COUNTERFACTUAL CREDIT ASSIGNMENT IN MODEL-FREE REINFORCEMENT LEARNING

Abstract

Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

1. INTRODUCTION

Reinforcement learning (RL) agents act in their environments and learn to achieve desirable outcomes by maximizing a reward signal. A key difficulty is the problem of credit assignment (Minsky, 1961) , i.e. to understand the relation between actions and outcomes and to determine to what extent an outcome was caused by external, uncontrollable factors, i.e. to determine the share of 'skill' and 'luck'. One possible solution to this problem is for the agent to build a model of the environment, and use it to obtain a more fine-grained understanding of the effects of an action. While this topic has recently generated a lot of interest (Ha & Schmidhuber, 2018; Hamrick, 2019; Kaiser et al., 2019; Schrittwieser et al., 2019) , it remains difficult to model complex, partially observed environments. In contrast, model-free reinforcement learning algorithms such as policy gradient methods (Williams, 1992; Sutton et al., 2000) perform simple time-based credit assignment, where events and rewards happening after an action are credited to that action, post hoc ergo propter hoc. While unbiased in expectation, this coarse-grained credit assignment typically has high variance, and the agent will require a large amount of experience to learn the correct relation between actions and rewards. Another issue of model-free methods is that counterfactual reasoning, i.e. reasoning about what would have happened had different actions been taken with everything else remaining the same, is not possible. Given a trajectory, model-free methods can in fact only learn about the actions that were actually taken to produce the data, and this limits the ability of the agent to learn quickly. As environments grow in complexity due to partial observability, scale, long time horizons, and large number of agents, actions taken by the agent will only affect a vanishing part of the outcome, making it increasingly difficult to learn from classical reinforcement learning algorithms. We need better credit assignment techniques. In this paper, we investigate a new method of credit assignment for model-free reinforcement learning which we call Counterfactual Credit Assignment (CCA), that leverages hindsight information to implicitly perform counterfactual evaluation -an estimate of the return for actions other than the ones which were chosen. These counterfactual returns can be used to form unbiased and lower variance estimates of the policy gradient by building future-conditional baselines. Unlike classical Q functions, which also provide an estimate of the return for all actions but do so by averaging over all possible futures, our methods provide trajectory-specific counterfactual estimates, i.e. an estimate of the return for different actions, but keeping as many of the external factors constant between the return and its counterfactual estimate. Our method is inspired by ideas from causality theory, but does not require learning a model of the environment. Our main contributions are: a) proposing a set of environments which further our understanding of when difficult credit assignment leads to poor policy learning; b) introducing new model-free policy gradient algorithms, with sufficient conditions for unbiasedness and guarantees for lower variance. In the appendix, we further c) present a collection of model-based policy gradient algorithms extending previous work on counterfactual policy search; d) connect the literature about causality theory, in particular notions of treatment effects, to concepts from the reinforcement learning literature.

2.1. NOTATION

We use capital letters for random variables and lowercase for the value they take. Consider a generic MDP (X , A, p, r, γ). Given a current state x ∈ X and assuming an agent takes action a ∈ A, the agent receives reward r(x, a) and transitions to a state y ∼ p(•|x, a). The state (resp. action, reward) of the agent at step t is denoted X t (resp. A t , R t ). The initial state of the agent X 0 is a fixed x 0 . The agent acts according to a policy π, i.e. action A t is sampled from the policy π θ (•|X t ) where θ are the parameters of the policy, and aims to optimize the expected discounted return E[G] = E[ t γ t R t ]. The return G t from step t is G t = t ≥t γ t -t R t . Finally, we define the score function s θ (π θ , a, x) = ∇ θ log π θ (a|x); the score function at time t is denoted S t = ∇ θ log π θ (A t |X t ). In the case of a partially observed environment, we assume the agent receives an observation E t at every time step, and simply define X t to be the set of all previous observations, actions and rewards X t = (O ≤t ), with O t = (E t , A t-1 , R t-1 ).foot_0 P(X) will denote the probability distribution of a random variable X.

2.2. POLICY GRADIENT ALGORITHMS

We begin by recalling two forms of policy gradient algorithms and the credit assignment assumptions they make. The first is the REINFORCE algorithm introduced by Williams (1992), which we will also call the single-action policy gradient estimator: Proposition 1 (single action estimator). The gradient of E[G] is given by ∇ θ E[G] = E t≥0 γ t S t (G t -V (X t )) , where V (X t ) = E[G t |X t ]. The appeal of this estimator lies in its simplicity and generality: to evaluate it, the only requirement is the ability to simulate trajectories, and compute both the score function and the return. Let us note two credit assignment features of the estimator. First, the score function S t is multiplied not by the whole return G, but by the return from time t. Intuitively, action A t can only affect states and rewards coming after time t, and it is therefore pointless to credit action A t with past rewards. Second, removing the value function V (X t ) from the return G t does not bias the estimator and typically reduces variance. This estimator updates the policy through the score term; note however the learning signal only updates the policy π θ (a|X t ) at the value taken by action A t = a (other values are only updated through normalization). The policy gradient theorem from (Sutton et al., 2000) , which we will also call all-action policy gradient, shows it is possible to provide learning signal to all actions, given we have access to a Q-function Q π (x, a) = E[G t |X t = x, A t = a], which we will call a critic in the following. Proposition 2 (All-action policy gradient estimator). The gradient of E[G] is given by ∇ θ E[G] = E [ t γ t a ∇ θ π θ (a|X t )Q π θ (X t , a)] . A particularity of the all-actions policy gradient estimator is that the term at time t for updating the policy ∇π θ (a|X t )Q π θ (X t , a) depends only on past information; this is in contrast with the score function estimates above which depend on the return, a function of the entire trajectory. Proofs can be found in appendix D.1.

2.3. INTUITIVE EXAMPLE ON HINDSIGHT REASONING AND SKILL VERSUS LUCK

Imagine a scenario in which Alice just moved to a new city, is learning to play soccer, and goes to the local soccer field to play a friendly game with a group of other kids she has never met. As the game goes on, Alice does not seem to play at her best and makes some mistakes. It turns out however her partner Megan is a strong player, and eventually scores the goal that makes the game a victory. What should Alice learn from this game?



Previous actions and rewards are provided as part of the observation as it is generally beneficial to do so in partially observable Markov decision processes.

