COUNTERFACTUAL CREDIT ASSIGNMENT IN MODEL-FREE REINFORCEMENT LEARNING

Abstract

Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

1. INTRODUCTION

Reinforcement learning (RL) agents act in their environments and learn to achieve desirable outcomes by maximizing a reward signal. A key difficulty is the problem of credit assignment (Minsky, 1961) , i.e. to understand the relation between actions and outcomes and to determine to what extent an outcome was caused by external, uncontrollable factors, i.e. to determine the share of 'skill' and 'luck'. One possible solution to this problem is for the agent to build a model of the environment, and use it to obtain a more fine-grained understanding of the effects of an action. While this topic has recently generated a lot of interest (Ha & Schmidhuber, 2018; Hamrick, 2019; Kaiser et al., 2019; Schrittwieser et al., 2019) , it remains difficult to model complex, partially observed environments. In contrast, model-free reinforcement learning algorithms such as policy gradient methods (Williams, 1992; Sutton et al., 2000) perform simple time-based credit assignment, where events and rewards happening after an action are credited to that action, post hoc ergo propter hoc. While unbiased in expectation, this coarse-grained credit assignment typically has high variance, and the agent will require a large amount of experience to learn the correct relation between actions and rewards. Another issue of model-free methods is that counterfactual reasoning, i.e. reasoning about what would have happened had different actions been taken with everything else remaining the same, is not possible. Given a trajectory, model-free methods can in fact only learn about the actions that were actually taken to produce the data, and this limits the ability of the agent to learn quickly. As environments grow in complexity due to partial observability, scale, long time horizons, and large number of agents, actions taken by the agent will only affect a vanishing part of the outcome, making it increasingly difficult to learn from classical reinforcement learning algorithms. We need better credit assignment techniques. In this paper, we investigate a new method of credit assignment for model-free reinforcement learning which we call Counterfactual Credit Assignment (CCA), that leverages hindsight information to implicitly perform counterfactual evaluation -an estimate of the return for actions other than the ones which were chosen. These counterfactual returns can be used to form unbiased and lower variance estimates of the policy gradient by building future-conditional baselines. Unlike classical Q functions, which also provide an estimate of the return for all actions but do so by averaging over all possible futures, our methods provide trajectory-specific counterfactual estimates, i.e. an estimate of the return for different actions, but keeping as many of the external factors constant between the return and its counterfactual estimate. Our method is inspired by ideas from causality theory, but does not require learning a model of the environment. Our main contributions are: a) proposing a set of environments which further our understanding of when difficult credit assignment leads to poor policy learning; b) introducing new model-free policy gradient algorithms, with sufficient conditions for unbiasedness and guarantees for lower variance. In the appendix, we further c) present a collection of model-based policy gradient algorithms extending previous work on counterfactual policy search; d) connect the literature about causality theory, in particular notions of treatment effects, to concepts from the reinforcement learning literature.

2.1. NOTATION

We use capital letters for random variables and lowercase for the value they take. Consider a generic MDP (X , A, p, r, γ). Given a current state x ∈ X and assuming an agent takes action a ∈ A, the agent receives reward r(x, a) and transitions to a state y ∼ p(•|x, a). The state (resp. action, reward) of the agent at step t is denoted X t (resp. A t , R t ). The initial state of the agent X 0 is a fixed x 0 . The agent acts according to a policy π, i.e. action A t is sampled from the policy π θ (•|X t ) where θ are the parameters of the policy, and aims to optimize the expected discounted return E[G] = E[ t γ t R t ]. The return G t from step t is G t = t ≥t γ t -t R t . Finally, we define the score function s θ (π θ , a, x) = ∇ θ log π θ (a|x); the score function at time t is denoted S t = ∇ θ log π θ (A t |X t ). In the case of a partially observed environment, we assume the agent receives an observation E t at every time step, and simply define X t to be the set of all previous observations, actions and rewards X t = (O ≤t ), with O t = (E t , A t-1 , R t-1 ).foot_0 P(X) will denote the probability distribution of a random variable X.

2.2. POLICY GRADIENT ALGORITHMS

We begin by recalling two forms of policy gradient algorithms and the credit assignment assumptions they make. The first is the REINFORCE algorithm introduced by Williams (1992), which we will also call the single-action policy gradient estimator: Proposition 1 (single action estimator). The gradient of E[G] is given by ∇ θ E[G] = E t≥0 γ t S t (G t -V (X t )) , where V (X t ) = E[G t |X t ]. The appeal of this estimator lies in its simplicity and generality: to evaluate it, the only requirement is the ability to simulate trajectories, and compute both the score function and the return. Let us note two credit assignment features of the estimator. First, the score function S t is multiplied not by the whole return G, but by the return from time t. Intuitively, action A t can only affect states and rewards coming after time t, and it is therefore pointless to credit action A t with past rewards. Second, removing the value function V (X t ) from the return G t does not bias the estimator and typically reduces variance. This estimator updates the policy through the score term; note however the learning signal only updates the policy π θ (a|X t ) at the value taken by action A t = a (other values are only updated through normalization). The policy gradient theorem from (Sutton et al., 2000) , which we will also call all-action policy gradient, shows it is possible to provide learning signal to all actions, given we have access to a Q-function Q π (x, a) = E[G t |X t = x, A t = a], which we will call a critic in the following. Proposition 2 (All-action policy gradient estimator). The gradient of E[G] is given by ∇ θ E[G] = E [ t γ t a ∇ θ π θ (a|X t )Q π θ (X t , a)] . A particularity of the all-actions policy gradient estimator is that the term at time t for updating the policy ∇π θ (a|X t )Q π θ (X t , a) depends only on past information; this is in contrast with the score function estimates above which depend on the return, a function of the entire trajectory. Proofs can be found in appendix D.1.

2.3. INTUITIVE EXAMPLE ON HINDSIGHT REASONING AND SKILL VERSUS LUCK

Imagine a scenario in which Alice just moved to a new city, is learning to play soccer, and goes to the local soccer field to play a friendly game with a group of other kids she has never met. As the game goes on, Alice does not seem to play at her best and makes some mistakes. It turns out however her partner Megan is a strong player, and eventually scores the goal that makes the game a victory. What should Alice learn from this game? When using the single-action policy gradient estimate, the outcome of the game being a victory, and assuming a ±1 reward scheme, all her actions are made more likely; this is in spite of the fact that during this particular game she may not have played well and that the victory is actually due to her strong teammate. From an RL point of view, her actions are wrongly credited for the victory and positively reinforced as a result; effectively, Alice was lucky rather than skillful. Regular baselines do not mitigate this issue, as Alice did not a priori know the skill of Megan, resulting in a guess she had a 50% chance of winning the game and corresponding baseline of 0. This could be fixed by understanding that Megan's strong play were not a consequence of Alice's play, that her skill was a priori unknown but known in hindsight, and that it is therefore valid to retroactively include her skill level in the baseline. A hindsight baseline, conditioned on Megan's estimated skill level, would therefore be closer to 1, driving the advantage (and corresponding learning signal) close to 0. As pointed out by Buesing et al. (2019) , situations in which hindsight information is helpful in understanding a trajectory are frequent. In that work, the authors adopt a model-based framework, where hindsight information is used to ground counterfactual trajectories (i.e. trajectories under different actions, but same randomness). Our proposed approach follows a similar intuition, but is model-free: we attempt to measure-instead of model-information known in hindsight to compute a future-conditional baseline, with the constraint that the captured information must not have been caused by the agent.

2.4. FUTURE-CONDITIONAL POLICY GRADIENT ESTIMATOR (FC-PG)

Intuitively, our approach for assigning proper credit to action A t is as follows: via learning statistics Φ t we capture relevant information from the rest of the trajectory, e.g. including observations O t at times t greater than t. We then learn value functions which are conditioned on the additional hindsight information contained in Φ t . In general, these future-conditional values and critics would be biased for use in a policy gradient algorithm; we therefore need to correct their impact on the policy gradient through an importance correction term. Theorem 1 (Future single-action policy gradient estimator). Let Φ t be an arbitrary random variable. The following is an unbiased estimator of the gradient of E[G]: ∇ θ E[G] = E t γ t S t G t - π θ (A t |X t ) P π θ (A t |X t , Φ t ) V (X t , Φ t ) (1) where V (X t , Φ t ) = E[G t |X t , Φ t ] is the future Φ-conditional value functionfoot_1 , and P π θ (A t |X t , Φ t ) is the posterior probability of action A t given (X t , Φ t ), for trajectories generated by policy π θ . Theorem 2 (Future all-action policy gradient estimator). The following is an unbiased estimator of the gradient of E[G]: ∇ θ E[G] = E t γ t a ∇ θ log π θ (a|X t )P π θ (a|X t , Φ t )Q π θ (X t , Φ t , a) where Q π (X t , Φ t , a) = E[G t |X t , Φ t , A t = a] is the future-conditional Q function (critic). Further- more, we have Q π θ (X t , a) = E Q π θ (X t , Φ t , a) Pπ(a|Xt,Φt) π(a|Xt) . Proofs can be found in appendix D.2. These estimators bear similarity to (and indeed, generalize) the Hindsight Credit Assignment estimator (Harutyunyan et al., 2019) , see the literature review and appendix C for a discussion of the connections.

2.5. COUNTERFACTUAL CREDIT ASSIGNMENT POLICY GRADIENT (CCA-PG)

The previous section provides a family of estimators, but does not specify which Φ should be used, and what type of Φ would make the estimator useful. Instead of hand-crafting Φ, we will learn to extract Φ from the trajectory (the sequence of observations) (O t ) t ≥0 . A useful representation Φ of the future will simultaneously satisfy two objectives: • Φ t is predictive of the outcome (the return) by learning a Φ-conditional value function, through minimization of (G t -V (X t , Φ t )) 2 or (G t -Q(X t , a, Φ t )) 2 . • The statistic Φ t is 'not a consequence' of action A t ; this is done by minimizing (with respect to Φ t ) a surrogate independence maximization (IM) loss L IM which is non-negative and zero if and only if A t and Φ t are conditionally independent given X t . Intuitively, the statistics Φ capture exogenous factors to the agent (hence the conditional independence constraint), but that still significantly affect the outcome (hence the return prediction loss). The IM constraint enables us to derive the CCA-PG estimator: Theorem 3 (single-action CCA-PG estimator). If A t is independent from Φ t given X t , the following is an unbiased estimator of the gradient of E[G]: ∇ θ E[G] = E t γ t S t (G t -V (X t , Φ t )) Furthermore, the hindsight advantage has no higher variance than the forward one: E (G t -V (X t , Φ t )) 2 ≤ E (G t -V (X t )) 2 . Theorem 4 (all-action CCA-PG estimator). Under the same condition, the following is an unbiased estimator of the gradient of E[G]: ∇ θ E[G] = E t γ t a ∇ θ π θ (a|X t )Q π θ (X t , Φ t , a) (4) Also, we have for all a, Q π θ (X t , a) = E[Q π θ (X t , Φ t , a)|X t , A t = a]. Proofs can be found in appendix D.3. The benefit of the first estimator (equation 3) is clear: under the specified condition, and compared to the regular policy gradient estimator, the CCA estimator is also unbiased, but the variance of its advantage G t -V (X t , Φ t ) (the critical component behind variance of the overall estimator) is no higher. For the all-action estimator, the benefits of CCA (equation 4) are less self-evident, since this estimator has higher variance than the regular all action estimator (which has variance 0). The interest here lies in bias due to learning imperfect Q functions. Both estimators require learning a Q function from data; any error in Q leads to a bias in π. Learning Q(X t , a) requires averaging over all possible trajectories initialized with state X t and action a: in high variance situations, this will require a lot of data. In contrast, if the agent could measure a quantity Φ t which has a high impact on the return but is not correlated to the agent action A t , it could be far easier to learn Q(X t , Φ t , a). This is because Q(X t , Φ t , a) computes the averages of the return G t conditional on (X t , Φ t , a); if Φ t has a high impact on G t , the variance of that conditional return will be lower, and learning its average will in turn be simpler. Interestingly, note also that Q(X t , Φ t , a) (in contrast to Q(X t , a)) is a trajectory-specific estimate of the return for a counterfactual action.

2.6. ALGORITHMIC AND IMPLEMENTATION DETAILS

In this section, we provide one potential implementation of the CCA-PG estimator. Note however than in order to be valid, the estimator only needs to satisfy the conditional independence assumption, and alternative strategies could be investigated. The agent is composed of four components: • Agent network: We assume the agent constructs an internal state X t from (O t ) t ≤t using an arbitrary network, for instance an RNN, i.e. X t = RNN θ (O t , X t-1 ). From X t the agent computes a policy π θ (a|X t ). • Hindsight network: Additionally, we assume the agent uses a hindsight network ϕ with parameters which computes a hindsight statistic Φ t = ϕ θ ((O, X, A)) (where (O, X, A) is the sequence of all observations, agent states and actions in the trajectory), which may depend arbitrarily on the vectors of observations, agent states and actions (in particular, it may depend on observations from timesteps t ≥ t). We investigated two architectures. The first is a backward RNN, where (Φ t , B t ) = RNN θ (X t , B t+1 ), where B t is the state of the backward RNN. Backward RNNs are justified in that they can extract information from arbitrary length sequences, and allow making the statistics Φ t a function of the entire trajectory. They also have the inductive bias of focusing more on near-future observations. The second is a transformer (Vaswani et al., 2017; Parisotto et al., 2019) . Alternative networks could be used, such as attention-based networks (Hung et al., 2019) or RIMs (Goyal et al., 2019) . • Value network: The third component is a future-conditional value network V θ (X t , Φ t ). • Hindsight classifier: The last component is a probabilistic classifier h ω with parameters ω that takes X t , Φ t as input and outputs a distribution over A t . Learning is ensured through the minimization of four losses: the hindsight baseline loss L hs = t (G t -V θ (X t , Φ t )) 2 (optimized with respect to θ); the hindsight classifier loss, L sup = t E[log h ω (A t |X t , Φ t )] (optimized with respect to ω only -all other parameters are treated as constants); the policy gradient surrogate loss L PG = t log π θ (A t |X t )(G t -V (X t , Φ t )) , where the bar notation indicates that the quantity is treated as a constant from the point of view of gradient computation; and finally the aforementioned independence loss L IM , which ensures the conditional independence between A t and Φ t . We investigated two IM losses. The first is the Kullback-Leibler divergence between the distributions P π θ (A t |X t ) and P π θ (A t |X t , Φ t ). In this case, the KL can be estimated by a P π θ (a|X t ) (log P π θ (a|X t ) -log P π θ (a|X t , Φ t )); P π θ (a|X t ) is simply the policy π θ (a|X t ), and the posterior P π θ (a|X t , Φ t ) can be approximated by probabilistic classifier h ω (A t |X t , Φ t ). This results in L IM (t) = a π θ (a|X t ) (log π θ (a|X t ) -log h ω (a|X t , Φ t )) . We also investigated the conditional mutual information between A t and Φ t ; again approximated using h. We did not see significant differences between the two, with the KL slightly outperforming the mutual information. Finally, note that conversely to the classifier loss, when optimizing the IM loss, ω is treated as a constant. Parameter updates and a figure depicting the architecture can be found in Appendix A.

3. NUMERICAL EXPERIMENTS

Given its guarantees on lower variance and unbiasedness, we run all our experiments on the single action version of CCA-PG.

3.1. BANDIT WITH FEEDBACK

We first demonstrate the benefits of hindsight value functions in a toy problem designed to highlight these. We consider a contextual bandit problem with feedback. Given N, K ∈ N, we sample for each episode an integer context -N ≤ C ≤ N as well as an exogenous noise r ∼ N (0, σ r ). Upon taking action A ∈ {-N, . . . , N }, the agent receives a reward R = -(C-A) 2 + r . Additionally, the agent is provided with a K-dimensional feedback vector F = U C + V A + W r where U n , V n ∈ R K for -N ≤ n ≤ N , and W ∈ R K are fixed vectors; in our case, for each seed, they are sampled from standard Gaussian distribution and kept constant through all episodes. More details about this problem as well as variants are presented in Appendix B.1. For this problem, the optimal policy is to choose A = C, resulting in average reward of 0. However, the reward R is the sum of the informative reward -(C -A) 2 and the noisy reward r , uncorrelated to the action. The higher the standard deviation σ r , the more difficult it is to perform proper credit assignment, as high rewards are more likely due to a high value of r than an appropriate choice of action. On the other hand, the feedback F contains information about C, A and r . If the agent can extract information Φ from F in order to capture information about r and use it to compute a hindsight value function, the effect of the perturbation r may be removed from the advantage, resulting in a significantly lower variance estimator. However, if the agent blindly uses F to compute the hindsight value information, information about the context and action will 'leak' into the hindsight value, leading to an advantage of 0 and no learning: intuitively, the agent will assume the outcome is entirely controlled by chance, and that all actions are equivalent, resulting in a form of learned helplessness. We investigate the proposed algorithm with N = 10, K = 64. As can be seen on Fig. 1 , increasing the variance of the exogenous noise leads to dramatic decrease of performance for the vanilla PG estimator without the hindsight baseline; in contrast, the CCA-PG estimator is generally unaffected by the exogenous noise. For very low level of exogenous noise however, CCA-PG suffers from a decrease in performance. This is due to the agent computing a hindsight statistic Φ which is not perfectly independent from A, leading to bias in the policy gradient update. The agent attributes part of the reward to chance, despite the fact that in low-noise regime, the outcome is entirely due to the agent's action. To demonstrate this, and evaluate the impact of the independence constraint on performance, we run CCA-PG with different values of the weight λ IM of the independence max-imization loss, as seen in Fig. 1 . For lower values of this parameter, i.e. when Φ and A have a larger mutual information, the performance is dramatically degraded. Task Description. We introduce the Key-To-Door family of environments as a testbed of tasks where credit assignment is hard and is necessary for success. In this environment (cf. Fig. 2 ), the agent has to pick up a key in the first room, for which it has no immediate reward. In the second room, the agent can pick up 10 apples, that each give immediate rewards. In the final room, the agent can open a door (only if it has picked up the key in the first room), and receive a small reward. In this task, a single action (i.e picking up the key) has a very small impact on the reward it receives in the final room, while its episode return is largely driven by its performance in the second room (i.e picking up apples).

3.2. KEY-TO-DOOR ENVIRONMENTS

We now consider two instances of the Key-To-Door family that illustrate the difficulty of credit assignment in the presence of extrinsic variance. In the Low-Variance-Key-To-Door environment, each apple is worth a reward of 1 and opening the final door also gets a reward of 1. Thus, an agent that solves the apple phase perfectly sees very little variance in its episode return and the learning signal for picking up the key and opening the door is relatively strong. High-Variance-Key-To-Door keeps the overall structure of the Key-To-Door task, but now the reward for each apple is randomly sampled to be either 1 or 10, and fixed within the episode. In this setting, even an agent that has a perfect apple-phase policy sees a large variance in episode returns, and thus the learning signal for picking up the key and opening the door is comparatively weaker. Appendix B.2.1 has some additional discussion illustrating the difficulty of learning in such a setting.

Results

We test CCA-PG on our environments, and compare it against Actor-Critic (Williams (1992) , as well as State-conditional HCA and Return-conditional HCA (Harutyunyan et al., 2019) as baselines. We test using both a backward-LSTM (referred to as CCA-PG RNN) or an attention model (referred to as CCA-PG Attn) for the hindsight function. Details for experimental setup are provided in Appendix B.2.2. All results are reported as median performances over 10 seeds. We evaluate agents both on their ability to maximize total reward, as well as solve the specific credit assignment problem of picking up the key and opening the door. Figure 3 compares CCA-PG with the baselines on the High-Variance-Key-To-Door task. Both CCA-PG architectures outperform the baselines in terms of total reward, as well as probability of picking up the key and opening the door. This example highlights the capacity of CCA-PG to learn and incorporate trajectory-specific external factors into its baseline, resulting in lower variance estimators. Despite being a difficult task for credit assignment, CCA-PG is capable of solving it quickly and consistently. On the other hand, vanilla actor-critic is greatly impacted by this external variance, and needs around 3.10 9 environment steps to have an 80% probability of opening the door. CCA-PG also outperforms State-and Return-Conditional HCA, which do use hindsight information but in a more limited way than CCA-PG. On the Low-Variance-Key-To-Door task, due to the lack of extrinsic variance, standard actor-critic is able to perfectly solve the environment. However, it is interesting to note that CCA-PG still matches this perfect performance. On the other hand, the other hindsight methods struggle with both dooropening and apple-gathering. This might be explained by the fact that both these techniques do not guarantee lower variance, and rely strongly on their learned hindsight classifiers for their policy gradient estimators, which can be harmful when these quantities are not perfectly learned. See Appendix B.2.3 for additional experiments and ablations on these environments. These experiments demonstrate that CCA-PG is capable of efficiently leveraging hindsight information to mitigate the challenge of external variance and learn strong policies that outperform baselines. At the same time, it suffers no drop in performance when used in cases where external variance is minimal.

3.3. TASK INTERLEAVING

Motivation. In the real world, human activity can be seen as solving a large number of loosely related problems. These problems are not solved sequentially, as one may temporarily engage with a problem and only continue engaging with it or receive feedback from its earlier actions significantly later. At an abstract level, one could see this lifelong learning process as solving problems not in a sequential, but an interleaved fashion instead. The structure of this interleaving also will typically vary over time. Despite this very complex structure and receiving high variance rewards from the future, humans are able to quickly make sense of these varying episodes and correctly credit their actions. This learning paradigm is quite different from what is usually considered in reinforcement learning. Indeed, focus is mostly put on agents trained on a single task, with an outcome dominated by the agent's actions, where long term credit assignment is not required and where every episode will be structurally the same. To the end of understanding the effects of this interleaving on lifelong learning, we introduce a new class of environments capturing the structural properties mentioned above. In contrast to most work on multi-task learning, we do not assume a clear delineation between subtasks -each agent will encounter multiple tasks in a single episode, and it is the agent's responsibility to implicitly detect boundaries between them. Task Description. As described in Fig. 4 , this task consists of interleaved pairs of query-answer rooms with different visual contexts that represents different tasks. Each task has an associated mapping of 'good' (resp. 'bad') colors yielding to high (resp. zero) reward. Each episode is composed of randomly sampled tasks and color pairs within those tasks. The ordering and the composition of each episode is random across tasks and color pairs. A visual example of what an episode looks like can be seen in Fig. 4 . Additional details are provided in B.3.1. The 6 tasks we will consider next (numbered #1 to #6) are respectively associated with a reward of 80, 4, 100, 6, 2 and 10. Tasks #2, #4, #5 and #6 are referred to as 'hard' while tasks #1 and #3 as 'easy' because of their large associated rewards. The settings 2, 4 and 6-task are respectively considering tasks 1-2, 1-4 and 1-6. In addition to the total reward, we record the probability of picking up the correct square for the easy and hard tasks separately. Performance in the hard tasks will indicate ability to do fine-grained credit assignment. Results. While CCA-PG is able to perfectly solve both the 'easy' and 'hard' tasks in the three setups in less than 5.10 8 environment steps (Fig. 5 ), actor-critic is only capable to solve the 'easy' tasks for which the associated rewards are large. Even after 2.10 9 environment steps, actor-critic is still greatly impacted by the variance and remains incapable of solving 'hard' tasks in any of the three settings. CCA also outperforms actor-critic in terms of the total reward obtained in each setting. State-conditional and Return-conditional HCA were also evaluated on this task but results are not reported as almost no learning was taking place on the 'hard' tasks. Details for experimental setup are provided in B.3.2. All results are reported as median performances over 10 seeds. More results along with an ablation study can be found in B.3.3. Through efficient use of hindsight, CCA-PG is able to take into account trajectory-specific factors such as the kinds of rooms encountered in the episode and their associated rewards. In the case of the Multi-Task Interleaving environment, an informative hindsight function would capture the reward for different contexts and exposes as Φ t all rewards obtained in the episode except those associated with the current context. This experiment again highlights the capacity of CCA-PG to solve hard credit assignment problems in a context where the return is affected by multiple distractors, while PG remains highly sensitive to them.

4. RELATED WORK

This paper builds on work from Buesing et al. (2019) which shows how causal models and real data can be combined to generate counterfactual trajectories and perform off-policy evaluation for RL. Their results however require an explicit model of the environment. In contrast, our work proposes a model-free approach, and focuses on policy improvement. Oberst & Sontag (2019) 2019) also investigate future-conditional value functions; similar to us, they learn statistics of the future Φ from which returns can be accurately predicted, and show that doing so leads to learning better representations (but use regular policy gradient estimators otherwise). Instead of enforcing a information-theoretic constraint, they bottleneck information through the size of the encoding Φ. In domain adaptation (Ganin et al., 2016; Tzeng et al., 2017) , robustness to the training domain can be achieved by constraining the agent representation not to be able to discriminate between source and target domains, a mechanism similar to the one constraining hindsight features not being able to discriminate the agent's actions. Both Andrychowicz et al. (2017) and Rauber et al. (2017) leverage the idea of using hindsight information to learn goal-conditioned policies. Hung et al. (2019) leverage attention-based systems and episode memory to perform long term credit assignment; however, their estimator will in general be biased. Ferret et al. (2019) looks at the question of transfer learning in RL and leverage transformers to derive a heuristic to perform reward shaping. Arjona-Medina et al. ( 2019) also addresses the problem of long-term credit assignment by redistributing delayed rewards earlier in the episode; their approach still fundamentally uses time as a proxy for credit. Previous research also leverages the fact that baselines can include information unknown to the agent at time t (but potentially revealed in hindsight) but not affected by action A t , see e.g. (Wu et al., 2018; Foerster et al., 2018; Andrychowicz et al., 2020; Vinyals et al., 2019) . Note however that all of these require privileged information, both in the form of feeding information to the baseline inaccessible to the agent, and in knowing that this information is independent from the agent's action A t and therefore won't bias the baseline. Our approach seeks to replicate a similar effect, but in a more general fashion and from an agent-centric point of view, where the agent learns itself which information from the future can be used to augment its baseline at time t.

5. CONCLUSION

In this paper we have considered the problem of credit assignment in RL. Building on insights from causality theory and structural causal models we have developed the concept of future-conditional value functions. Contrary to common practice these allow baselines and critics to condition on future events thus separating the influence of an agent's actions on future rewards from the effects of other random events thus reducing the variance of policy gradient estimates. A key difficulty lies in the fact that unbiasedness relies on accurate estimation and minimization of mutual information. Learning inaccurate hindsight classifiers will result in miscalibrated estimation of luck, leading to bias in learning. Future research will investigate how to scale these algorithms to more complex environments, and the benefits of the more general FC-PG and all-actions estimators.

A ARCHITECTURE

The parameter updates are as follows: Parameter updates For each trajectory (X t , A t , R t ) t≥0 , compute the parameter updates : • ∆θ = -λ PG t ∇ θ log π θ (A t |X t )(G t -V (X t , Φ t )) -λ hs ∇ θ L hs (t) -λ IM ∇ θ t L IM (t) • ∆ω = -∇ ω L sup (t) where the different λ are the weights of each loss. For the bandit problems, the agent architecture is as follows: O t-1 O t O t+1 O t+2 X t-1 X t X t+1 X t+2 π t-1 π t π t+1 π t+2 A t-1 A t A t+1 A t+2 B t-1 B t B t+1 B t+2 Φ t-1 Φ t Φ t+1 Φ t+2 V t-1 V t V t+1 V t+2 • The hindsight feature Φ is computed by a backward RNN. We tried multiple cores for the RNN: GRU ( (Chung et al., 2015) with 32 hidden units, a recurrent adder (b t = b t-1 + MLP(x t ), where the MLP has two layers of 32 units), or an exponential averager (b t = λb t-1 + (1 -λ)MLP(x t )). • The hindsight classifier h ω is a simple MLP with two hidden layers with 32 units each. • The policy and value functions are computed as the output of a simple linear layer with concatenated observation and feedback as input. • All weights are jointly trained with Adam (Kingma & Ba, 2014). • Hyperparameters are chosen as follows (unless specified otherwise): learning rate 4e 4, entropy loss 4e 3, independence maximization tolerance β IM = 0.1; λ fwd = λ hw = 1; λ IM is set through Lagrangian optimization (GECO, Rezende & Viola (2018)).

B.1.2 ADDITIONAL RESULTS

Multiagent Bandit Problem: In the multi-agent version, which we will call MULTI-BANDIT, the environment is composed of M replicas of the bandit with feedback task. Each agent i = 1, . . . , M interacts with its own version of the environment, but feedback and rewards are coupled across agents; MULTI-BANDIT is obtained by modifying the single agent version as follows: • The contexts C i are sampled i.i.d. from {-N, . . . , N }. C and A now denote the concatenation of all agents' contexts and actions. • The feedback tensor is (M, K) dimensional, and is computed as W c 1(C) + W a 1(A) + f ; where the W are now three dimensional tensors. Effectively, the feedback for agent i depends on the context and actions of all other agents. • The observation for agent i at step t ≥ 1 is (0, F [t]), where F [t] = (F i,(t-1)B+1:tB ). • The terminal joint reward is i -(C i -A i 0 ) 2 for all agents. The multi-agent version does not require the exogenous noise e , as other agents play the role of exogenous noise; it is a minimal implementation of the example found in section 2.3. Finally, we report results from the MULTI-BANDIT version of the environment, which can be found in Fig. 7 . As the number of interacting agents increases, the effective variance of the vanilla PG estimator increases as well, and the performance of each agent decreases. In contrast, CCA-PG agents learn faster and reach higher performance (though they never learn the optimal policy). Table 1 shows the advantages for either picking up the key or not, for an agent that has a perfect apple-phase policy, but never picks up the key or door, on High-Variance-Key-To-Door. Since there are 10 apples which can be worth 1 or 10, the return will be either 10 or 100. Thus the forward baseline in they key phase, i.e. before it has seen how much an apple is worth in the current episode, will be 55. As seen here, the difference in advantages due to Luck is far larger than the difference in advantage due to Skill when not using hindsight, making learning difficult, leading to the policy never learning to start picking up the key or door. However, when we use a hindsight-conditioned baseline, we are able to learn a Φ (i.e. the value of a single apple in the current episode) that is completely independent from the actions taken by the agent, but which can provide a perfect hindsight-conditioned baseline of either 10 or 100. 

45. -45

Table 1 : The advantage of the action of picking up a key in High-Variance-Key-To-Door, as computed by an agent that always picks up every apple, and never picks up the key or door. We see that an advantage learned using hindsight clearly differentiates between the skillful and unskillful actions; whereas for an advantage learned without using hindsight, this difference is dominated by the extrinsic randomness.

B.2.2 ARCHITECTURE

The agent architecture is as follows: • The observation are first fed to 2-layer CNN with with (16, 32) output channels, kernel shapes of (3, 3) and strides of (1, 1). The output of the CNN is the flattened and fed to a linear layer of size 128. • The agent state is computed as a forward LSTM with a state size of 128. The input to the LSTM are the output of the previous linear layer, concatenated with the reward at the previous timestep. • The hindsight feature Φ is computed either by a backward LSTM (i.e CCA-PG RNN) with a state size of 128 or by an attention mechanism Vaswani et al. (2017) (i.e CCA-PG Att) with value and key sizes of 64, 1 transformer block with 2 attention heads and a 1 hidden layer mlp of size 1024, an output size of 128 and a rate of dropout of 0.1. The input provided is the concatenation of the output of the forward LSTM and the reward at the previous timestep. • The policy is computed as the output of a simple MLP with one layer with 64 units where the output of the forward LSTM is provided as input. • The forward baseline is computed as the output of a 3-layer MLP of 128 units each where the output of the forward LSTM is provided as input. • For CCA, the hindsight classifier h ω is computed as concatenation of the output of an MLP, with four hidden layers with 256 units each where the the concatenation of the output of the forward LSTM and the hindsight feature Φ is provided as input, and the log of the policy outputs. • For State HCA, the hindsight classifier h ω is computed as the output of an MLP, with four hidden layers with 256 units each where the the concatenation of the outputs of the forward LSTM at two given time steps is provided as input. • For Return HCA, the hindsight classifier h ω is computed as the output of an MLP, with four hidden layers with 256 units each where the the concatenation of the output of the forward LSTM and the return is provided as input. • The hindsight baseline is computed as the output of a 3-layer MLP of 128 units each where the concatenation of the output of the forward LSTM and the hindsight feature Φ is provided as input. The hindsight baseline is trained to learn the residual between the return and the forward baseline. • All weights are jointly trained with RMSprop (Hinton et al., 2012) For High-Variance-Key-To-Door, the optimal hyperparameters found and used for each algorithm can be found in Table 2 . For Key-To-Door, the optimal hyperparameters found and used for each algorithm can be found in Table 3 . The agents are trained on full-episode trajectories, using a discount factor of 0.99. As shown in Fig. 8 , in the case of of actor-critic, the baseline loss increases at first. As the reward associated with apples vary from one episode to another, getting more apples also means increasing the forward baseline loss. On the other hand, as CCA is able to take into account trajectory specific exogenous factors, the hindsight baseline loss can nicely decrease as learning takes place. Fig. 9 shows the impact of the variance level induced by the apple reward discrepancy between episodes on the probability of picking up the key and opening the door. Thanks to the use of hindsight in its value function, CCA-PG is almost not impacted by this whereas actor-critic sees its performances drops dramatically as variance increases. Figure 10 shows a qualitative analysis of the attention weights learned by CCA-PG Att on the High-Variance-Key-To-Door task. For this experiment, we use only a single attention head for easier interpretation of the hindsight function, and show both a heatmap of the attention weights over the entire episode, and a histogram of attention weights at the step where the agent picks up the key. As expected, the most attention is paid to timesteps just after the agent picks up an apple -since these are the points at which the apple reward is provided to the Φ computation. In particular, very little attention is paid to the timestep where the agent opens the door. These insights further show that the hindsight function learned is highly predictive of the episode return, while not having mutual information with the action taken by the agent, thus ensuring an unbiased policy gradient estimator.

B.3 MULTI TASKS INTERLEAVING B.3.1 ENVIRONMENT DETAILS

For each task, a random set, but fixed through training, set of 5 out of 10 colored squares are leading to a positive reward. Furthermore, a small reward of 0.5 is provided to the agent when it picks up any colored square. Each episode are 140 steps long and it takes 9 steps for the agent to reach one colored square from it initial position.

B.3.2 ARCHITECTURE

We use the same architecture setup as reported in Appendix B.2.2. The agents are also trained on full-episode trajectories, using a discount factor of 0.99. For Multi Tasks Interleaving, the optimal hyperparameters found and used for each algoritm found and used for each algorithm can be found in 4. As explained in 3.3, CCA is able to solve all 6 tasks quickly despite the variance induced by the exogenous factors. Actor-critic on the other hand despite solving the easy tasks 1 and 3 for which the agent receives a big reward, it is incapable of reliably solve the 4 remaining tasks for which the associated reward is smaller. This helps unpacking Fig. 5 . 

B.3.4 ABLATION STUDY

Fig. 12 shows the impact of the the number of back-propagation through time steps performed into the backward RNN of the hindsight function while performing full rollouts. This show that learning in hard tasks, where hindsight is crucial for performances, is not much impacted by the number of back-propagation steps performed into the backward RNN. This is great news as this indicates that learning in challenging credit assignment tasks can happen when the hindsight function sees the whole future but only can backprop through a limited window. Probability of picking up the correct squares for the hard and easy tasks in the 6-task setup of Multi Task Interleaving. Fig. 13 shows how performances of CCA with an RNN for the hindsight function are impacted by the unroll length. As expected, the less you are able to look into the future, the harder it becomes to solve this hard credit assignment task as you become limited in your capacity to take exogenous effects into account. The two previous results are exciting because to work at its fullest CCA seems to only require to have access to as many steps into the future as possible while not needing to do back-propagation through the full sequence. This observation is really handy as the environments considered become more complex and with longer episodes.

C RELATION BETWEEN HCA, CCA, AND FC ESTIMATORS

The FC estimators generalizes both the HCA and CCA estimator. From FC, we can derive CCA by assuming that Φ t and A t are conditionally independent (see next section). We can also derive state and return HCA from FC. For return HCA, we obtain both an all-action and baseline version of return HCA by choosing Φ t = G t . For state HCA, we first need to decompose the return into sums of rewards, and apply the policy gradient estimator to each reward separately. For a pair (X t , R t+k ), and assuming that R t+k is a function of X t+k for simplicity, we choose Φ t = X t+k . We then sum the different FC estimators for different values of k and obtain both an all-action and single-action version of state HCA. Note however that HCA and CCA cannot be derived from one another. Both estimators leverage different approaches for unbiasedness, one (HCA) leveraging importance sampling, and the other (CCA) eschewing importance sampling in favor of constraint satisfaction (in the context of inference, this is similar to the difference between obtaining samples of the posterior by importance sampling versus directly parametrizing the posterior distribution).

D PROOFS D.1 POLICY GRADIENTS

Proof of Proposition 1. By linearity of expectation, the expected return can be written as E[G] = t γ t E[R t ]. Writing the expectation as an integral over trajectories, we have: E[R t ] = x0,...,xt a0,...,at   s≤t (π θ (a s |x s )P (x s+1 |x s , a s ))   R(x t , a t ) Taking the gradient with respect to θ: ∇ θ E[R t ] = x0,...,xt a0,...,at   s ≤t ∇ θ π θ (a s |x s )P (x s +1 |x s , a s )   s≤t,s =s (π θ (a s |x s )P (x s+1 |x s , a s ))     R(x t , a t ) We then rewrite ∇ θ π θ (a s |x s ) = ∇ θ log π θ (a s |x s )π θ (a s |x s ) , and obtain ∇ θ E[R t ] = x0,...,xt a0,...,at   s ≤t ∇ θ π θ (a s |x s )   s≤t,s (π θ (a s |x s )P (x s+1 |x s , a s ))     R(x t , a t ) =E   s ≤t ∇ θ log π θ (A s |X s )R t   Summing over t, we obtain ∇ θ E[G] =E   t≥0 γ t s ≤t ∇ θ log π θ (A s |X s )R t   which can be rewritten (with a change of variables): ∇ θ E[G] =E   t≥0 ∇ θ log π θ (A t |X t ) t ≥t γ t R t   =E   t≥0 γ t ∇ θ log π θ (A t |X t ) t ≥t γ t -t R t   =E   t≥0 γ t S t G t   To complete the proof, we need to show that E[S t V (X t )] = 0. By iterated expecta- tion, E[S t V (X t )] = E[E[S t V (X t )|X t ]] = E[V (X t )E[S t |X t ]], and we have E[S t |X t ] = a ∇ θ π θ (a|X t ) = ∇ θ ( a π θ (a|X t )) = ∇ θ 1 = 0. Proof of Proposition 2. We start from the single action policy gradient ∇ θ E[G] = E t≥0 γ t S t G t and analyse the term for time t, E[S t G t ]. E[S t G t ] =E[E[S t G t |X t , A t ]] =E[S t E[G t |X t , A t ]] =E[S t Q(X t , A t )] =E [E[S t Q(X t , A t )|X t ]] =E a ∇ θ π θ (a|X t )Q(X t , a) The first and fourth inequality come from different applications of iterated expectations, the second from the fact S t is a constant conditional on X t , A t , and the third from the definition of Q(X t , A t ).

D.2 PROOF OF FC-PG THEOREM

Proof of theorem 1. We need to show that E S t π θ (At|Xt) Pπ(At|Xt,Φt) V (X t , Φ t ) = 0, so that π θ (At|Xt) Pπ(At|Xt,Φt) V (X t , Φ t ) is a valid baseline. As previously, we proceed with the law of iterated expectations, by conditioning successively on X t then Φ t E S t π θ (A t |X t ) P π (A t |X t , Φ t ) V (X t , Φ t ) =E E S t π θ (A t |X t ) P π (A t |X t , Φ t ) V (X t , Φ t ) X t , Φ t =E V (X t , Φ t )E S t π θ (A t |X t ) P π (A t |X t , Φ t ) X t , Φ t Then we note that E S t π θ (A t |X t ) P π (A t |X t , Φ t ) X t , Φ t = a P π (a|X t , Φ t )∇ log π θ (a|X t ) π θ (a|X t ) P π (a|X t , Φ t ) = a ∇π θ (a|X t ) = 0. Proof of theorem 2. We start from the definition of the Q function: Q(X t , a) =E [G t |X t , A t = a] = E Φt [E [G t |X t , Φ t , A t = a] |X t , A t = a] = φ P π (Φ = ϕ|X t , A t = a)Q(X t , Φ t = ϕ, a) We also have P π (Φ = ϕ|X t , A t ) = P π (Φ = ϕ|X t )P π (A t = a|X t , Φ t = φ) P π (A t = a|X t ) , which combined with the above, results in: Q(X t , a) = φ P π (Φ = ϕ|X t ) P π (A t = a|X t , Φ t = φ) π θ (a|X t ) Q(X t , Φ t , a) =E P π (A t = a|X t , Φ t = φ) π θ (a|X t ) Q(X t , Φ t , a) X t For the compatibility with policy gradient, we start from: E[S t G t ] =E a ∇ θ π θ (a|X t )Q(X t , a) We replace Q(X t , a) by the expression above and obtain E[S t G t ] =E a ∇ θ π θ (a|X t )E P π (A t = a|X t , Φ t = φ) π θ (a|X t ) Q(X t , Φ t , a) X t =E E a ∇ θ π θ (a|X t ) P π (A t = a|X t , Φ t = φ) π θ (a|X t ) Q(X t , Φ t , a) X t =E E a ∇ θ log π θ (a|X t )P π (A t = a|X t , Φ t = φ)Q(X t , Φ t , a) X t =E a ∇ θ log π θ (a|X t )P π (A t = a|X t , Φ t = φ)Q(X t , Φ t , a) Note that in the case of a large number of actions, the above can be estimated by ∇ θ log π θ (A t |X t )P π (A t |X t , Φ t = φ) π θ (A t |X t ) Q(X t , Φ t , A t ), where A t is an independent sample from π θ (.|X t ); note in particular that A t shall NOT be the action A t that gave rise to Φ t , which would result in a biased estimator.

D.3 PROOF OF CCA-PG THEOREMS

Assume that Φ t and A t are conditionally independent on X t . Then, Pπ( = 1. In particular, it is true when evaluating at the random value A t . From this simple observation, both CCA-PG theorems follow from the FC-PG theorems. To prove the lower variance of the hindsight advantage, note that V[G t -V (X t , Φ)] = E[(G t -V (X t , Φ t )) 2 ] =E[G 2 t ] -E[V (X t , Φ t ) 2 ] V[G t -V (X t )] = E[(G t -V (X t )) 2 ] =E[G 2 t ] -E[V (X t ) 2 ] where the second equality comes from the fact that E[G t V (X t , Φ t )|X t , Φ t ] = V (X t , Φ t ) 2 . To prove the first statement, we have (G t -V (X t , Φ t ) 2 = G 2 t + V (X t , Φ t ) 2 -2G t V (X t , Φ t ) , and apply the law of iterated expectations to the last term: E[G t V (X t , Φ t )] =E[E[G t V (X t , Φ t )|X t , Φ t ]] =E[V (X t , Φ t )E[G t |X t , Φ t ]] = E[V (X t , Φ t ) 2 ] The proof for the second statement is identical. Finally, we note that by Jensen's inequality, we have E[V (X t , Φ t ) 2 ] ≤ E[V (X t ) 2 ], from which we conclude that V[G t -V (X t , Φ t )] ≤ V[G t -V (X t )].

D.4 PROOFS OF MODEL-BASED GRADIENT THEOREMS IN APPENDIX E

Proof of Lemma 1. The proof follows from two simple facts. The first is that the return is a deterministic function G(X t , a, E t + ). The second is that, from the law of iterated expectations we have E E t + [G(X t , a , ε t + )] = E X t + [E E t + |X t + [G(X t , a , ε t + )]], for any distribution of X t + . The left hand-side is E X t + ∼p(.|Xt,a ) . Taking the distribution of X t + to be p(.|X t , a), we obtain the desired result. Proof of Lemma 2. The policy gradient can be written: A t ,E t + P (E t + )π θ (A t |X t )∇ θ log π θ (A t |X t )G(X t , A t , E t + ) But we also have: P (E t + ) = P (E t + |X t ) = X t + ,At π θ (A t |X t )P (X t + |A t , X t )P (E t + |X t + , A t ) For simplicity, denote κ = π θ (A t |X t )P (X t + |A t , X t )P (E t + |X t + , A t ). Combined with equation (5), we find: X t + ,At,E t + ,A t κπ θ (A t |X t )∇ θ log π θ (A t |X t )G(X t , A t , E t + ) Next, we analyze the same quantity but replacing G(X t , A t , E t + ) by G(X t , A t , E t + ), and find: X t + ,At,E t + ,A t κπ θ (A t |X t )∇ θ log π θ (A t |X t )G(X t , A t , E t + ) = X t + ,At,E t + κG(X t , A t , E t + ) A t π θ (A t |X t )∇ θ log π θ (A t |X t ) = 0 (7) since A t π θ (A t |X t )∇ θ log π θ (A t |X t ) = 0. Subtracting equation ( 7) from ( 6), we obtain the desired result.

MODELS

In this section, we provide an alternative view and intuition behind the CCA-PG algorithm by investigating credit assignment through the lens of causality theory, in particular structural causal models (SCMs) (Pearl, 2009a) . We relate these ideas to the use of common random numbers (CRN), a standard technique in optimization with simulators (Glasserman & Yao, 1992) . We start by presenting algorithms with full knowledge of the environment in the form of both a perfect model and access to the random number generator (RNG) and see how an SCM of the environment can improve credit assignment. We progressively relax assumptions until no knowledge of the environment or its random number generator is required and CCA-PG is recovered.

E.1 STRUCTURAL CAUSAL MODEL OF THE MDP

Structural causal models (SCM) (Pearl, 2009a) are, informally, models where all randomness is exogenous, and where all variables of interest are modeled as deterministic functions of other variables and of the exogenous randomness. They are of particular interest in causal inference as they enable reasoning about interventions, i.e. how would the distribution of a variable change under external influence (such as forcing a variable to take a given value, or changing the process that defines a varaible), and about counterfactual interventions, i.e. how would a particular observed outcome (sample) of a variable have changed under external influence. Formally, a SCM is a collection of model variables {V ∈ V }, exogenous random variables {E ∈ E}, and distributions {p E (ε), E ∈ E}, one per exongeous variable, and where the exogenous random variables are all assumed to be independent. Each variable V is defined by a function V = f V (pa(V ), E), where pa(V ) is a subset of V called the parents of V . The model can be represented by a directed graph in which every node has an incoming edge from each of its parents. For the SCM to be valid, the induced graph has to be a directed acyclic graph (DAG), i.e. there exists a topological ordering of the variables such that for any variable V i , pa(V i ) ⊂ {V 1 , . . . , V i-1 }; in the following we will assume such an ordering. This provides a simple sampling mechanism for the model, where the exogenous random variables are first sampled according to their distribution, and each node is then computed in indexing order. Note that any probabilistic model can be represented as a SCM by virtue of reparametrization Kingma & Ba (2014); Buesing et al. (2019) . However, such a representation is not unique, i.e. different SCMs can induce the same distribution. We now parameterize the MDP given in section 2.1 as a SCM. The transition from X t to X t+1 under A t is given by the transition function f X : X t+1 = f X (X t , A t , E X t ) with exogenous variable / random number E X t . The policy function f π maps a random number E π t , policy parameters θ, and current state X t to the action A t = f π (X t , E π t , θ). Together, f π and E π t induce the policy, a distribution π θ (A t |X t ) over actions. Without loss of generality we assume that the reward is a deterministic function of the state and action: R t = f R (X t , A t ). E X and E π are random variables with a fixed distribution; all changes to the policy are absorbed by changes to the deterministic function f π . Denoting E t = (E X t , E π t ), note the next reward and state (X t+1 , R t ) are deterministic functions of X t and E t , since we have X t+1 = f X (X t , f π (X t , E π t , θ), E X t ) and similarly R t = R(X t , f π (X t , E π t , θ). Let X t + = (X t ) t >t and similarly, E t + = (E X t , E t ) t >t Through the composition of the functions f X , f π and R, the return G t (under policy π θ ) is a deterministic function (denoted G for simplicity) of X t , A t and E t + .

E.2 MODEL-KNOWN POLICY GRADIENT

In this section, we assume perfect knowledge of the transition functions, reward functions, and SCM distribution. We use the term 'model-known' rather than 'model-based' to describe this situation. Consider a time t, state X t , and a possible action a for A t . The return G t is given by the deterministic function G(X t , a, E t + ), and the Q function Q(X t , a) = E E t + [G t |X t , A t = a] is its expectation over the exogenous variables E t + . We are generally interested in evaluating the Q function difference Q(X t , a) -Q(X t , a ) for two actions a and a . Note in particular that the advantage can be written A(X t , a) = E a ∼π θ [Q(X t , a) -Q(X t , a )] ). The Q function difference can be estimated by a difference G(X t , a, E t + ) -G(X t , a , E t + ) where E t + and E t + are two independent samples. If we have direct access to E t + , for instance because we we have access to a simulator and to its random number generator, we can use common random numbers to potentially reduce variance: G(X t , a, E t + ) -G(X t , a , E t + ). Note that if actions were continuous, G differentiable and a = a + δ with δ small, the quantity becomes ∂G ∂a (X t , a, E t + ) × δ, i.e. the gradient of the return G with respect to the action a (see Silver et al. 2014; Heess et al. 2015; Buesing et al. 2016) . In general, we will be interested in cases where R may not be differentiable. However, the example highlights that the use of gradient methods implicitly assumes the use of common random numbers, and that return differences computed with common random numbers can be seen as a numerical approximation to the gradient of the return. Having access to the model, suppose we make a two sample estimate of the policy gradient using common random numbers, and use the return of one action as baseline for the other. The policy gradient estimate is ∇ θ V (x 0 ) = E At,A t ∼π θ ,E t + [S t (G(X t , A t , E t + ) -G(X t , A t , E t + ))] (8) where we recall that we defined S t as ∇ θ log π θ (A t |X t ; θ). In many situations this estimate will have lower variance than one obtained with a state-conditional baseline (cf. eq. ( 1)) since the use of common noise for G will strongly correlate return and baseline (which differ only in a single argument to the function G). Since A t and A t are samples from the same distribution, the update above remains valid if we swap A t and A t ; averaging the two updates, we obtain a two point policy gradient: ∇ θ V (x 0 ) = 1 2 E At,A t ,E t + [Y t (G(X t , A t , E t + ) -G(X t , A t , E t + ))], where Y t denotes the score function differential (∇ θ log π θ (A t |X t ; θ) -∇ θ log π θ (A t |X t ; θ)). The use of a model is required since we need returns from the same state with two different actions (note that in the case of a POMDP the same initial state would require the same history, which is often computationally excessive to do). More generally, we could use K i.i.d. samples A (1) t , . . . , A (K) t and use the leave-one-out average empirical return as a baseline for each sample, which yields ∇ θ V (x 0 ) = 1 K E A (1) t ,...,A (K) t ,E t + i ∇ θ log π θ (A (i) t |X t ; θ) ∆ i , where ∆ i ∆ = G(X t , A (i) t , E t + ) -1 K-1 j =i G(X t , A t , E t + ) . The idea of using multiple rollouts from the same initial state to perform more accurate credit assignment for policy gradient methods has been used under the name vine by Schulman et al. (2015) . The authors also note the need for common random numbers to reduce the variance of the multiple rollout estimate (see also Ng & Jordan 2013) . Interestingly, if we replace ∆ i by the argmax of softmax of the ∆, we obtain a gradient estimate similar to that of the cross-entropy method, a classical and very effective planning algorithm (De Boer et al., 2005; Langlois et al., 2019) .

E.3 MODEL-BASED COUNTERFACTUAL POLICY GRADIENT

In the previous section, we derived low-variance policy updates under the assumption that we have access to both a perfect model and its noise generation process. We will now see how modelbased counterfactual reasoning allows us to address both of these restrictions, recalling results from (Buesing et al., 2019) . First we briefly recall what counterfactuals are, in particular in the context of reinforcement learning. Counterfactual query intuitively correspond to question of the form 'how would this precise outcome have changed, had I changed a past action to another one?'. In a structural model that consists of outcome variables X and action variables A set to a, estimating the counterfactual outcome X under an alternative action a consists in the following three steps: • Abduction: infer the exogenous noise variables E under the observation: E ∼ P (E|X). • Intervention: Fix the value of A to a . • Prediction: Evaluate the outcome X conditional on the fixed values E and A = a . We begin with a lemma (following results from Buesing et al. (2019) ), which explains that assuming model correctness, expectations of counterfactual estimates are equal to regular interventional expectations. Denote p(τ |X t ) the distribution of trajectories starting from X t and following π θ , p(τ |X t , a) the distribution of trajectories starting from X t , A t = a, and following π θ after A t , and p(τ |X t , a, E t + ) the distribution of the trajectories starting at X t , A t = a, following the policy π θ but forcing the value of all the SCM exogenous random variables to E t + (note that this last distribution is in fact a deterministic quantity, since all randomness has been fixed). Lemma 1. Under the assumptions above, Q(X t , a ) = E X t + ∼p(.|Xt,a ) [G] = E X t + ∼p(.|Xt,a) E E t + |X t + E X t + ∼p(.|Xt,a ,ε t + ) [G] . In other words, we can use SCMs to perform off-policy or counterfactual evaluation without importance sampling, as long as we can infer the exogenous variables of interest. This lemma is particularly useful when using an imperfect model, which we now assume is the only model available. We denote the 'real-world' or data distribution by p  ∇ θ V (x 0 ) = t γ t E At,A t ∼π θ ,E t + ∼p M [S t (G M (X t , A t , E t + ) -G M (X t , A t , E t + ))], Using an imperfect model, this update could have high bias. Instead of fully trusting the synthetic data generated by the model, we can combine model data and real data in equation ( 11), by sampling the outer expectation with respect to p D and the inner ones with respect to p M . We obtain the following counterfactual policy gradient estimate: Lemma 2. Assuming no model bias, the policy gradient update is equal to ∇ θ V (x 0 ) = t γ t E X t + ∼p D (Xt,π θ ) E E t + ∼p M (E t + |X t + ),A t ∼π θ [S t (G t -G t )] where S t = ∇ θ log π θ (A t |X t ) is the score function for the counterfactual action, and where G t = G M (X t , A t , E t + ) is the model-based counterfactual return estimate. If we explicitly marginalize out A t , we obtain: E X t + ∼p D (Xt) E E t + ∼p M (E t + |X t + ) a ∇ θ π θ (a|X t )(G M (X t , a, E t + ) -G t ) Note that in contrast to eq. ( 13), and even when assuming a perfect model and posterior, the following update will generally be biased (we will later explain why): ∇ θ V (x 0 ) = t γ t E X t + ∼p D (Xt) E E t + ∼p M (E t + |X t + ),A t ∼π θ [S t (G t -G t )] In equations ( 13) and ( 14) above, X t + is sampled by the real environment, and E t + is from the posterior noise given the observations (which would generally be given by Bayes rule, following P (E t + |X t + ) ∝ P (E t + )P (X t + |E t + )). In particular, this estimate does not require access to the random number generator -instead, it 'measures' (estimates) what noise in the model must have been to give rise to the observations given by the real environment. The term G t = G D (X t , A t , E t + ) is the empirical real-world return, while G t = G M (X t , A t , E t + ) is the counterfactual return that would have happened mutatis mutandis for action A t , under the same noise realization. A very slight modification to equation ( 13) is to use the environment only to sample the trajectory X t + , but to use the model for both evaluations of the return: E X t + ∼p D (Xt,π θ ) E E t + ∼p M (E t + |X t + ) [S t (G M (X t , A t , E t + ) -G M (X t , A t , E t + ))] This may lead to less variance, and potentially even less bias: even though G M is a biased estimate of G D , some of the bias will show up in both terms and cancel out, while it would remain in G D -G M . Note also that in the presence of model bias, it is likely that equations ( 13) and ( 16) would suffer from significantly less issues (bias and variance) than their purely model-based alternative (12), as the counterfactual updates are grounded in real data (X t + ∼ p D (X)) and corresponding to a "reconstruction" instead of a prior sample.

E.4 FUTURE CONDITIONAL VALUE FUNCTIONS

In the previous sub-section, we assumed knowledge of a (potentially imperfect) model but no access to the random number generation; in this sub-section, we make the inverse assumption: we assume we have access to the random number generation, but develop a model-free method that can leverage the access to the entropy engine without explicitly assuming the model. Let us consider again the vanilla, single action policy gradient estimate: ∇ θ V (x 0 ) = E[S t (G t -V (X t ))] Classically, the baseline function is assumed to be a function of X t (recall that in the POMDP case, X t includes the history of observations). If V is a function of any quantity which is statistically dependent on A t conditionally on X t , the baseline could result in a biased estimator for the policy gradient. A sufficient, standard assumption to guarantee this condition, is to not use any data from the future relative to time step t as input for V , although such knowledge is available in principle in off-line policy updates. While the optimal baseline may not necessarily be a state value function, a good surrogate for determining a baseline is to minimize the variance of the advantage: for a state-dependent baseline, this corresponds to setting the baseline to the value function. Note that in a structural causal model the random variables E are explicitly assumed to have a distribution affected by no other random variables, in particular E t + ⊥ ⊥ A t |X t . It is therefore valid to include them in the baseline; by the same argument as above, a strong candidate baseline is therefore V (X t , E t + ) = E[G t |X t , E t + ]. What does this baseline correspond to? Note that in this expectation the only randomness left is in action a t ; the corresponding generalized value function is in fact V (X t , E t + ) = a π θ (a|s t )G(X t , a, E t + ). Learning this value function is therefore very closely related to learning the return function G, which itself is closely related to learning the composition of the transition and reward functions. The corresponding policy gradient becomes: E E t + ,At [S t (G t -V (X t , E t + )]] where V (X t , E t + ) can be learned by minimizing the square loss between a function of X t , E t + and empirical returns G t . Note that this estimate of the advantage is also lower variance than that of G t -V (X t ), following V (X t ) = E(V (X t , E t + )) and Jensen's inequality (see Weber et al. (2019) for a proof, generalized definitions of value functions, and conditions for valid baselines; and see Weaver & Tao (2001) ; Greensmith et al. (2004) for results on optimal baselines for policy gradients.).

E.5 RECOVERING MODEL-FREE COUNTERFACTUAL POLICY GRADIENTS (CCA-PG)

In the last two sections, we relaxed the assumptions of having access to either a model of the environment, or to the access to the random number generator. In this section, we combine both ideas to recover our proposed algorithm, CCA-PG. To do so, we follow the idea from section E.4 that a future-conditional value function can be modellike and result in improved credit assignment; however, like in section E.3, instead of assuming knowledge of E t + , we will estimate it from trajectory information. Let F t represent any subset or function of the trajectory, such as the sequence of states X t + , the return, the sequence of observations, actions, etc. In MDPs, F t only needs to be a function of present and future states, in POMDPs, F t will need to be a function of the entire trajectory, for instance, of present and future agent state. A first approach, related to distributional reinforcement learning (Veness et al., 2015; Bellemare et al., 2017) , is to forego modeling the environment (as in section E.3) and directly model distributions over returns or value functions. We can induce such a probabilistic models over returns, by assuming a given parametrized base distribution p θ (E), and approximate posterior q(E|F), and value function V (X t , E t + ). These components can be learned by the KL-regularized regression t E E q(E t |F t ) log q(E t |F t ) p(E t ) + (V (X t , E t ) -G t ) 2 + (Q(X t , A t , E t ) -G t ) 2 . This equation intuitively captures the idea of measuring a E t from F t such that E t is approximately independent from the trajectory (represented by the KL loss) yet good at predicting the return (represented by the value loss). We can then train a policy with counterfactual policy gradient in the following way: For each time t, sample E t from q(E t |F t ), compute V and Q and either apply update (3) or ( 4). This approach is flawed however: Even if E ⊥ ⊥ A t |X t is assumed to hold under the prior, this will not hold in general under the the posterior q, i.e. to which extent the agent knows about the true value of E t will in general depend on A t . For instance, consider a POMDP corresponding to a maze navigation task, where the only uncertainty is the maze layout. Including the maze layout in the value function will not bias the policy gradient update, and typically lower its variance. However, if we train (in a supervised fashion) a network to produce an estimate of the map given the agent's observations, and provided the value function with a hindsight estimate of the map, the resulting policy update would in general be biased. This is the same reason why equation ( 15) is in general biased, even assuming a perfect model and posterior. For this reason, we forgo explicit probabilistic modeling, and choose an implicit approach, modeling a function Φ t of the trajectory that captures information for predicting return, and therefore only implicitly perfors inference over E t . Following the intuition developed in this section, we require that Φ t be independent of A t while predicting returns accurately, which finally connects back to the algorithms detailed in section 2.

F LINKS TO CAUSALITY AND SIMPLE EXAMPLES

In this section, we will further link the ideas developed in this report to causality theory. In particular we will connect them to two notions of causality theory known as individual treatment effect (ITE) and average treatment effect (ATE). In the previous section, we extensively leveraged the framework of structural causal models. It is however known that distinct SCMs may correspond to the same distribution; learning a model from data, we may learn a model with correct distribution but with with incorrect structural parametrization and counterfactuals. We may therefore wonder whether counterfactual-based approaches may be flawed when using such a model. We investigate this question, and analyze our algorithm in very simple settings for which closed-form computations can be worked out. F.1 INDIVIDUAL AND AVERAGE TREATMENT EFFECTS Consider a simple medical example which we model with an SCM as illustrated in figure 14 . We assume population of patients, each with a full medical state denoted S, which summarizes all factors, known or unknown, which affect a patient's future health such as genotype, phenotype etc. While S is never known perfectly, some of the patient's medical history H may be known, including current symptoms. On the basis of H, a treatment decision T is taken; as is often done, for simplicity we consider T to be a binary variable taking values in {1='treatment', 0='no treatment'}. Finally, health state S and treatment T result in a observed medical outcome O, a binary variable taking values in {1='cured', 0='not cured'}. For a given value S = s and T = t, the outcome is a function (also denoted O for simplicity) O(s, t). Additional medical information F may be observed, e.g. further symptoms or information obtained after the treatment, from tests such as X-rays, blood tests, or autopsy.  Since the exogenous noise (here, S) is generally not known, the ITE is typically an unknowable quantity. For a particular patient (with hidden state S), we will only observe the outcome under T = 0 or T = 1, depending on which treatment option was chosen; the counterfactual outcome will typically be unknown. Nevertheless, for a given SCM, it can be counterfactually estimated from the outcome and feedback, using the procedure detailed in section E.3 (we suppose O is included in F to simplify notation) Definition 3 (Counterfactually Estimated Individual Treatment Effect). 

CF-ITE[H

In general the counterfactually estimated ITE will not be exactly the ITE, since there may be remaining uncertainty on s. However, the following statements relate CF-ITE, ITE and ATE: • If S is identifiable from O and F with probability one, then the counterfactually-estimated ITE is equal to the ITE. • The average (over S, conditional on H) of the ITE is equal to the ATE. • The average (over S and F , conditional on H) of CF-ITE is equal to the ATE. Assimilating O to a reward, the above illustrates that the ATE (equation 19) essentially corresponds to a difference of Q functions, the ITE (equation 18) to the return differences found in equations ( 8) and ( 17), and the counterfactual ITE to the quantities found in equations ( 13) and ( 14). In contrast, the advantage G t -V (H t ) is a difference between a return (a sample-level quantity) and a value In both models -the true value of giving the drug is 2/3, and not giving the drug 1/3, which leads to an ATE of 1/3. For each model, we will evaluate the variance of the CF-ITE, under one of the four possible treatment-outcome pair. The results are summarized in table 5. Under model A, the variance of the CF-ITE estimate (which is the variance of the advantage used in CCA-PG gradient) is 1/6, while it is 1 under model B, which would imply A is a better model to leverage counterfactuals into policy decisions. 



Previous actions and rewards are provided as part of the observation as it is generally beneficial to do so in partially observable Markov decision processes. Note more generally that any function of Xt and Φt can in fact be used as a valid baseline.



, hindsight entropy weight = 0.0 CCA-PG, hindsight entropy weight = 0.1 CCA-PG, hindsight entropy weight = 1.0 CCA-PG, hindsight entropy weight = 10.0

Figure 1: Left: Comparison of CCA-PG and PG in contextual bandits with feedback, for various levels of reward noise σr. Results are averaged over 6 independent runs. Right: Performance of CCA-PG on the bandit task, for different values of λIM. Not properly enforcing the independence constraint results in strong degradation of performance.

Figure 2: Key-To-Door environments visual. The agent is represented by the beige pixel, key by brown, apples by green, and the final door by blue. The agent has a partial field of view, highlighted in white.

Figure 3: Probability of opening the door and total reward obtained on the High-Variance-Key-To-Door task (left two) and the Low-Variance-Key-To-Door task (right two).

Figure 4: Multi Task Interleaving Description. Top left. Delayed feedback contextual bandit problem. Given a context shown as a surrounding visual pattern, the agent has to decide to pick up one of the two colored squares where only one will be rewarding. The agent is later teleported to the second room where it is provided with the reward associated with its previous choice and a visual cue about which colored square it should have picked up. Top right: Different tasks which each a different color mapping, visual context and associated reward. Bottom: Example of a generated episode, composed of randomly sampled tasks and color pairs.

Figure5: Probability of solving 'easy' and 'hard' tasks and total reward obtained for the Multi Task Interleaving. Top plot: Median over 10 seeds after doing a mean over the performances in 'easy' or 'hard' tasks.

Figure 6: Overall architecture for the RNN network. For simplicity we assume without loss of generality that Bt and Φt include Xt.

Figure 7: Multiagent versions of the bandit problems. CCA-PG agents outperform PG in the single timestep version.

Figure 9: Impact of variance over credit assignment performances. Probability of opening the door and total reward obtained as a function of the variance level induced by the apple reward discrepancy.

Figure 10: Visualization of attention weights on the High-Variance-Key-To-Door task. Left: a 2-dimensional heatmap showing how the hindsight function at each step attends to each step in the future. Red lines indicate the timesteps at which apples are picked up (marked as 'a'); green indicates the door (marked as 'd'); yellow indicates the key (marked as 'k'). Right: A bar plot of attention over future timesteps, computed at the step where the agent is just about to pick up the key.

Figure 11: Probability of solving each task in the 6-task setup for Multi Task Interleaving.

Figure 12: Impact of the number of back-propagation through time steps performed into the hindsight function for CCA RNN. Probability of solving the hard tasks in the 6-task setup of Multi Task Interleaving.

Figure13: Impact of the unroll length over the probability of solving hard and easy tasks for CCA RNN. Probability of picking up the correct squares for the hard and easy tasks in the 6-task setup of Multi Task Interleaving.

D and model distributions by p M . Also, let G D denote the true return function and G M the imperfect model of it. Using the model, model-based variants of equation (8) are obtained by simply replacing p by p M :

Figure 14: The medical treatment example as a structured causal model.

= h, F = f, T = 1] = δ(o = 1)s P (S = s |H = h, F = f, T = 1)O(s , T = 0) (20) CF-ITE[H = h, F = f, T = 0] = s P (S = s |H = h, F = f, T = 1)O(s , T = 0) -δ(o = 1)

CCA-PG variance estimates in the medical example. CF-Probs. Red value are estimates for model I, blue ones are for model II. CF-Prob denotes posterior probabilities of the genetic state S given the treatment T and outcome O. CF-O is the counterfactual outcome. The ITE is the individual treatment effect (difference between outcome and counterfactual outcome). CF-V is the counterfactual value function, computed as the average of CF-O under the posterior probabilities for S. CF-ITE is the counterfactual advantage estimate (difference between O and CF-V). Var is the variance of CF-ITE under the prior probabilities for the outcome.

also investigate counterfactuals in reinforcement learning, point out the issue of non-identifiability of the correct SCM, and suggest a sufficient condition for identifiability; we discuss this issue in appendix F. Closely related to our work is Hindsight Credit Assignment, a concurrent approach fromHarutyunyan et al. (2019); in this paper, the authors also investigate value functions and critics that depend on future information. However, the information the estimators depend on is hand-crafted (future state or return) instead of arbitrary functions of the trajectory; their estimators is not guaranteed to have lower variance. Our FC estimator generalizes their estimator, and CCA further characterizes which statistics of the future provide a useful estimator. Relations between HCA, CCA and FC are discussed in appendix C. The HCA approach is further extended byYoung (2019), andZhang et al. (2019) who minimize a surrogate for the variance of the estimator, but that surrogate cannot be guaranteed to actually lower the variance. Similarly to state-HCA, it treats each reward separately instead of taking a trajectory-centric view as CCA.Guez et al. (

Key-To-Door hyperparameters

annex

function (a population-level quantity, which averages over all individuals with the same medical history H); this discrepancy explains why the return-based advantage estimate can have very high variance.As mentioned previously, for a given joint distribution over observations, rewards and actions, there may exist distinct SCMs that capture that distribution. Those SCMs will all have the same ATE, which measures the effectiveness of a policy on average. But they will generally have different ITE and counterfactual ITE, which, when using model-based counterfactual policy gradient estimators, will lead to different estimators. Choosing the 'wrong' SCM will lead to the wrong counterfactual, and so we may wonder if this is a cause for concern for our methods.We argue that in terms of learning optimal behaviors (in expectation), estimating inaccurate counterfactual is not a cause for concern. Since all estimators have the same expectation, they would all lead to the correct estimates for the effect of switching a policy for another, and therefore, will all lead to the optimal policy given the information available to the agent. In fact, one could go further and argue that for the purpose of finding good policies in expectations, we should only care about the counterfactual for a precise patient inasmuch as it enables us to quickly and correctly taking better actions for future patients for whom the information available to make the decision (H) is very similar. This would encourage us to choose the SCM for which the CF-ITE has minimal variance, regardless of the value of the true counterfactual. In the next section, we elaborate on an example to highlight the difference in variance between different SCMs with the same distribution and optimal policy.

F.2 BETTING AGAINST A FAIR COIN

We begin from a simple example, borrowed from Pearl (2009b) , to show that two SCMs that induce the same interventional and observational distributions can imply different counterfactual distributions. The example consists of a game to guess the outcome of a fair coin toss. The action A and state S both take their values in {h, t}. Under model I, the outcome O is 1 if A = S and 0 otherwise. Under model II, the guess is ignored, and the outcome is simply O = 1 if S = h. For both models, the average treatment effect E[O|A = h] -E[O|A = t] is 0 implying that in both models, one cannot do better than random guessing. Under model I, the counterfactual for having observed outcome O = 1 and changing the action, is always O = 0, and vice-versa (intuitively, changing the guess changes the outcome). Therefore, the ITE is ±1. Under model II, all counterfactual outcomes are equal to the observed outcomes, since the action has in fact no effect on the outcome. The ITE is always 0.In the next section, we will next adapt the medical example into a problem in which the choice of action does affect the outcome. Using the CF-ITE as an estimator for the ATE, we will find how the choice of the SCM affects the variance of that estimator (and therefore how the choice of the SCM should affect the speed at which we can learn which is the optimal treatment decision).

F.3 MEDICAL EXAMPLE

Take the simplified medical example from figure 14 , where a population of patients with the same symptoms come to the doctor, and the doctor has a potential treatment T to administer. The state S represents the genetic profile of the patient, which can be one of three {GENE A , GENE B , GENE C } (each with probability 1/3). We assume that genetic testing is not available and that we do not know the value of S for each patient. The doctor has to make a decision whether to administer drugs to this population or not, based on repeated experiments; in other words, they have to find out whether the average treatment effect is positive or not. We consider the two following models:• In model I, patients of type GENE A always recover, patients of type GENE C never do, and patients of type GENE B recover if they get the treatment, and not otherwise; in particular, in this model, administering the drug never hurts.• In model II, patients of type GENE A and GENE B recover when given the drug, but not patients of type GENE C ; the situation is reversed (GENE A and GENE B patients do not recover, GENE C do) when not taking the drug.

