PRACTICAL MARGINALIZED IMPORTANCE SAMPLING WITH THE SUCCESSOR REPRESENTATION

Abstract

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current stateof-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

1. INTRODUCTION

Off-policy evaluation (OPE) is a reinforcement learning (RL) task where the aim is to measure the performance of a target policy from data collected by a separate behavior policy (Sutton & Barto, 1998) . As it can often be difficult or costly to obtain new data, OPE offers an avenue for re-using previously stored data, making it an important challenge for applying RL to real-world domains (Zhao et al., 2009; Mandel et al., 2014; Swaminathan et al., 2017; Gauci et al., 2018) . Marginalized importance sampling (MIS) (Liu et al., 2018; Xie et al., 2019; Nachum et al., 2019a) is a family of OPE methods which re-weight sampled rewards by directly learning the density ratio between the state-action occupancy of the target policy and the sampling distribution. This approach can have significantly lower variance than traditional importance sampling methods (Precup et al., 2001) , which consider a product of ratios over trajectories, and is amenable to deterministic policies and behavior agnostic settings where the sampling distribution is unknown. However, the body of MIS work is largely theoretical, and as a result, empirical evaluations of MIS have mostly been carried out on simple low-dimensional tasks, such as mountain car (state dim. of 2) or cartpole (state dim. of 4). In comparison, deep RL algorithms have shown successful behaviors in high-dimensional domains such as Humanoid locomotion (state dim. of 376) and Atari (image-based). In this paper, we present a straightforward approach for MIS that can be computed from the successor representation (SR) of the target policy. Our algorithm, the Successor Representation DIstribution Correction Estimation (SR-DICE), is the first method that allows MIS to scale to highdimensional systems, far outperforming previous approaches. In comparison to previous algorithms which rely on minimax optimization or kernel methods (Liu et al., 2018; Nachum et al., 2019a; Uehara & Jiang, 2019; Mousavi et al., 2020) , SR-DICE requires only a simple convex loss applied to the linear function determining the reward, after computing the SR. Similar to the deep RL methods which can learn in high-dimensional domains, the SR can be computed easily using behavior-agnostic temporal-difference (TD) methods. This makes our algorithm highly amenable to deep learning architectures and applicable to complex tasks. Our derivation of SR-DICE also reveals an interesting connection between MIS methods and value function learning. The key motivation for MIS methods is, unlike traditional importance sampling methods, they can avoid variance with an exponential dependence on horizon, by re-weighting individual transitions rather than accumulating ratios along entire trajectories. We remark that while the MIS ratios only consider individual transitions, the optimization procedure is still subject to the dynamics of the underlying MDP. Subsequently, we use this insight to show a connection between a well-known MIS method, DualDICE (Nachum et al., 2019a) , and Bellman residual minimization (Bellman, 1957; Baird, 1995) , which can help explain some of the optimization properties and performance of DualDICE, as well as other related MIS methods. We benchmark the performance of SR-DICE on several high-dimensional domains in MuJoCo (Todorov et al., 2012) and Atari (Bellemare et al., 2013) , against several recent MIS methods. Our results demonstrate two key findings regarding high-dimensional tasks. SR-DICE significantly outperforms the benchmark algorithms. We attribute this performance gap to SR-DICE's deep RL components, outperforming the MIS baselines in the same way that deep RL outperforms traditional methods on high-dimensional domains. Unfortunately, part of this performance gap is due to the fact that the baseline MIS methods scale poorly to challenging tasks. In Atari we find that the baseline MIS method exhibit unstable estimates, often reaching errors with many orders of magnitude. MIS underperforms deep RL. Although SR-DICE achieves a high performance, we find its errors are bounded by the quality of the SR. Consequently, we find that SR-DICE and the standard SR achieve a similar performance across all tasks. Worse so, we find that using a deep TD method, comparable to DQN (Mnih et al., 2015) for policy evaluation outperforms both methods. Although the performance gap is minimal, for OPE there lacks a convincing argument for SR-DICE, or any current MIS method, which introduce unnecessary complexity. However, this does not mean MIS is useless. We remark that the density ratios themselves are an independent objective which have been used for applications such as policy regularization (Nachum et al., 2019b; Touati et al., 2020) , imitation learning (Kostrikov et al., 2019) , off-policy policy gradients (Imani et al., 2018; Liu et al., 2019b; Zhang et al., 2019) , and non-uniform sampling (Sinha et al., 2020) . SR-DICE serves as a stable, scalable approach for computing these ratios. We provide extensive experimental details in the supplementary material and our code is made available.

2. BACKGROUND

Reinforcement Learning. RL is a framework for maximizing accumulated reward of an agent interacting with its environment (Sutton & Barto, 1998) . This problem is typically framed as a Markov Decision Process (MDP) (S, A, R, p, d 0 , γ), with state space S, action space A, reward function R, dynamics model p, initial state distribution d 0 and discount factor γ. An agent selects actions according to a policy π : S × A → [0, 1] . In this paper we address the problem of off-policy evaluation (OPE) problem where the aim is to measure the normalized expected per-step reward of the policy R (π) = (1 -γ)E π [ ∞ t=0 γ t r(s t , a t )]. An important notion in OPE is the value function Q π (s, a) = E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a], which measures the expected sum of discounted rewards when following π, starting from (s, a). We define d π (s, a) as the discounted state-action occupancy, the probability of seeing (s, a) under policy π with discount γ: d π (s, a) = (1 -γ) ∞ t=0 γ t s0 d 0 (s 0 )p π (s 0 → s, t)π(a|s)d(s 0 ) , where p π (s 0 → s, t) is the probability of arriving at the state s after t time steps when starting from an initial state s 0 . This distribution is important as R(π) equals the expected reward r(s, a) under d π : R(π) = E (s,a)∼d π ,r [r(s, a)]. Successor Representation. The successor representation (SR) (Dayan, 1993) of a policy is a measure of occupancy of future states. It can be viewed as a general value function that learns a vector of the expected discounted visitation for each state. The successor representation Ψ π of a given policy π is defined as Ψ π (s |s) = E π [ ∞ t=0 γ t 1(s t = s )|s 0 = s]. Importantly, the value function can be recovered from the SR by summing over the expected reward of each state V π (s) = s Ψ π (s |s)E a ∼π [r(s , a )]. For infinite state and action spaces, the SR can instead be generalized to the expected occupancy over features, known as the deep SR (Kulkarni et al., 2016) or successor features (Barreto et al., 2017) . For a given encoding function φ : S ×A → R n , the deep SR ψ π : S × A → R n is defined as the expected discounted sum over features from the encoding function φ when starting from a given state-action pair and following π: ψ π (s, a) = E π ∞ t=0 γ t φ(s t , a t ) s 0 = s, a 0 = a . If the encoding φ(s, a) is learned such that the original reward function is a linear function of the encoding r(s, a) = w φ(s, a), then similar to the original formulation of SR, the value function can be recovered from a linear function of the SR: Q π (s, a) = w ψ π (s, a). The deep SR network ψ π is trained to minimize the MSE between ψ π (s, a) and φ(s, a) + γψ (s , a ) on transitions (s, a, s ) sampled from the data set. A frozen target network ψ is used to provide stability (Mnih et al., 2015; Kulkarni et al., 2016) , and is updated to the current network ψ ← ψ π after a fixed number of time steps. The encoding function φ is typically trained by an encoder-decoder network (Kulkarni et al., 2016; Machado et al., 2017; 2018a) . (3)

Marginalized

The goal of marginalized importance sampling methods is to learn the weights w(s, a) ≈ d π (s,a) d D (s,a) , using data contained in D. The main benefit of MIS is that unlike traditional importance methods, the ratios are applied to individual transitions rather than complete trajectories, which can reduce the variance of long or infinite horizon problems. In other cases, the ratios themselves can be used for a variety of applications which require estimating the occupancy of state-action pairs. DualDICE. Dual stationary DIstribution Correction Estimation (DualDICE) (Nachum et al., 2019a) is a well-known MIS method which uses a minimax optimization to learn the density ratios. The underlying objective which DualDICE aims to minimize is the following: min f J(f ) := 1 2 E (s,a)∼d D (f (s, a) -γE s ,π [f (s , a )]) 2 -(1 -γ)E s0,a0∼π [f (s 0 , a 0 )]. It can be shown that Equation ( 4) is uniquely optimized by the MIS density ratio. However, since f (s, a) -γE π [f (s , a )] is dependent on transitions (s, a, s ), there are two practical issues with this underlying objective. First, the objective contains a square within an expectation, giving rise to the double sampling problem (Baird, 1995) , where the gradient will be biased when using only a single sample of (s, a, s ). Second, computing f (s, a) -γE s ,π [f (s , a )] for arbitrary state-action pairs, particularly those not contained in the data set, is non-trivial, as it relies on an expectation over succeeding states, which is generally inaccessible without a model of the environment. To address both concerns, DualDICE uses Fenchel duality (Rockafellar, 1970) to create the following minimax optimization problem: min f max w J(f, w) := E (s,a)∼d D ,a ∼π,s w(s, a)(f (s, a) -γf (s , a )) -0.5w(s, a) 2 -(1 -γ)E s0,a0 [f (s 0 , a 0 )]. Similar to the original formulation, Equation (4), it can be shown that Equation ( 5) is minimized when w(s, a) is the desired density ratio.

3. A REWARD FUNCTION PERSPECTIVE ON DISTRIBUTION CORRECTIONS

In this section, we present our behavior-agnostic approach to estimating MIS ratios, called the Successor Representation DIstribution Correction Estimation (SR-DICE). Our main insight is that MIS can be viewed as an optimization over the reward function, where the loss is uniquely optimized when the reward is the desired density ratio. We then apply our reward function perspective on a well-known MIS method, DualDICE (Nachum et al., 2019a) , which enables us to observe difficulties in the optimization process and better understand related methods. All proofs for this section are left to Appendix A.

3.1. THE SUCCESSOR REPRESENTATION DICE

We will now derive our MIS approach. Our derivation shows that by treating MIS as reward function optimization, we can obtain the desired density ratios can be obtained in a straightforward manner from the SR of the target policy. This pushes the challenging aspect of learning onto the computation of the SR, rather than optimizing the density ratio estimate. Furthermore, when tackling highdimensional tasks, we can leverage deep RL approaches (Mnih et al., 2015; Kulkarni et al., 2016) to make learning the SR stable, giving rise to a practical MIS method. Our aim is to determine the MIS ratios d π (s,a) d D (s,a) , using only data sampled from the data set D and the policy π. This presents a challenge as we have direct access to neither d π nor d D . As a starting point, we begin by following the derivation of DualDICE (Nachum et al., 2019a) . We first consider the convex function 1 2 mx 2 -nx, which is uniquely minimized by x * = n m . Now by replacing x with r(s, a), m with d D (s, a), and n with d π (s, a), we have reformulated the convex function as the following objective: min r(s,a)∀(s,a) J(r) := 1 2 E (s,a)∼d D r(s, a) 2 -(1 -γ)E (s,a)∼d π [r(s, a)] . While this objective is still impractical as it relies on expectations over both d D and d π , from Nachum et al. (2019a) we can state the following about Equation ( 6). Observation 1 The objective J(r) is minimized when r(s, a) = d π (s,a) d D (s,a) , ∀(s, a). Now we will diverge from the derivation of DualDICE. Note our choice of notation, r(s, a), in Equation ( 6). Describing the objective in terms of a fictitious reward r will allow us to draw on familiar relationships between rewards and value functions and build stronger intuition. Consider the equivalence between the value function over initial state-action pairs and the expectation of rewards over the state-action visitation of the policy (1 -γ)E s0,a0 [Q π (s 0 , a 0 )] = E d π [r(s, a)]. It follows that the expectation over d π in Equation ( 6) can be replaced with a value function Qπ over r: min r(s,a)∀(s,a) J(r) := 1 2 E (s,a)∼d D r(s, a) 2 -(1 -γ)E s0,a0 Qπ (s 0 , a 0 ) . Using (1 -γ)E s0,a0 Qπ (s 0 , a 0 ) = E d π [r(s, a)] provides a method for accessing the otherwise intractable d π . This form of the objective is convenient because we can estimate the expectation over d D by sampling from the data set and Q π can be computed using any policy evaluation method. While we can estimate both terms in Equation ( 7) with relative ease, the optimization problem is not directly differentiable and would require re-learning the value function Qπ with every adjustment to the learned reward r. Fortunately, there exists a straightforward paradigm which enables direct reward function optimization known as successor representation (SR). Consider the relationship between the SR Ψ π of the target policy π and its value function r(s, a) ]] in the tabular setting. It follows that we can create an optimization problem over the reward function r from Equation (7): E s0,a0 [Q π (s 0 , a 0 )] = E s0 [V π (s 0 )] = E s0 [ s Ψ π (s|s 0 )E π [ min r(s,a)∀(s,a) J Ψ (r) := 1 2 E (s,a)∼d D r(s, a) 2 -(1 -γ)E s0 s Ψ π (s|s 0 )E a∼π [r(s, a)] . This objective can be generalized to continuous states by considering the deep SR ψ π over features φ(s, a) and optimizing the weights of a linear function w. In this instance, the estimated density ratio r(s, a) is determined by w φ(s, a) and we can optimize w by minimizing the following: min w J(w) := 1 2 E d D (w φ(s, a)) 2 -(1 -γ)E s0,a0∼π w ψ π (s 0 , a 0 ) . Since this optimization problem is convex, it has a closed form solution. Define D 0 as the set of start states contained in D. The unique optimizer of Equation ( 9) is as follows: min w J(w) = (1 -γ) |D| (s,a)∈D φ(s, a) i φ i (s, a) 1 |D 0 | s0∈D0 π(a 0 |s 0 )ψ π (s 0 , a 0 ), where φ i is the ith entry of the vector φ. However, we may generally prefer iterative, gradientbased solutions for scalability. We call the combination of learning the deep SR followed by optimizing Equation ( 9) the Successor Representation stationary DIstribution Correction Estimation  min ψ π 1 2 (φ(s, a) + γψ (s , a ) -ψ π (s, a)) 2 . 6: for t = 1 to T 3 do 7: min w 1 2 (w φ(s, a)) 2 -(1 -γ)w ψ π (s 0 , a 0 ). 8: Output: R(π) estimate |D| -1 (s,a,r)∈D w φ(s, a)r(s, a). # Encoding φ loss # Deep successor representation ψ π loss # Density ratio w loss (Equation ( 9)) (SR-DICE). SR-DICE is split into three learning phases: (1) learning the encoding φ, (2) learning the deep SR ψ π , and (3) optimizing Equation (9) . For the first two phases we follow standard practices from prior work (Kulkarni et al., 2016; Machado et al., 2018a) , training the encoding φ via an encoder-decoder network to reconstruct the transition and training the deep SR ψ π using TD learning-style methods. We summarize SR-DICE in Algorithm 1. Additional implementation-level details can be found in Appendix D. Although it is difficult to make any guarantees on the accuracy of an approximate ψ π trained with deep RL techniques, if we assume ψ π is exact, then we can show that SR-DICE learns the least squares estimator to the desired density ratio. Theorem 1 Assuming (1 -γ)E s0,a0 [ψ π (s 0 , a 0 )] = E (s,a)∼d π [φ(s, a)], then the optimizer w * of the objective J(w) is the least squares estimator of S×A w φ(s, a) -d π (s,a) d D (s,a) 2 d(s, a). Hence, the main sources of error in SR-DICE are learning the encoding φ and the deep SR ψ π . Notably, both of these steps are independent of the main optimization problem of learning w, as we have shifted the challenging aspects of density ratio estimation onto learning the deep SR. This leaves deep RL to do the heavy lifting. The remaining optimization problem, Equation ( 9), only involves directly updating the weights of a linear function, and unlike many other MIS methods, requires no tricky minimax optimization. SR-DICE can also be applied to any pre-existing SR, or included into standard deep RL algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Hessel et al., 2017; Fujimoto et al., 2018) by treating the encoding φ as an auxiliary reward. This provides an alternate form of policy evaluation through MIS, or a method to access density ratios between the target policy and the sampling distribution, with possible applications to exploration, policy regularization, or unbiased off-policy gradients (Liu et al., 2019b; Nachum et al., 2019b; Touati et al., 2020) .

3.2. REWARD FUNCTIONS & MIS: A CASE STUDY ON DUALDICE

One of the main attractions for MIS methods is they use importance sampling ratios which re-weight individual transitions rather than entire trajectories. While independent of the length of trajectories collected by the behavior policy, we remark the optimization problem is not independent of the implicit horizon defined by the discount factor γ and MIS methods are still subject to the dynamics of the underlying MDP. In SR-DICE we explicitly handle the dynamics of the MDP by learning the SR with TD learning methods. In this case study, we examine a well-known MIS method, DualDICE (Nachum et al., 2019a) , and discuss how it propagates updates through the MDP by considering its relationship to residual algorithms which minimize the mean squared Bellman error (Baird, 1995) . By viewing other MIS methods through the lens of reward function optimization, we can understand their connection to value-based methods, shedding light on their optimization properties and challenges. Recall the underlying objective of DualDICE: min f J(f ) := 1 2 E (s,a)∼d D (f (s, a) -γE s ,π [f (s , a )]) 2 -(1 -γ)E s0,a0∼π [f (s 0 , a 0 )]. (11) By viewing the problem as reward function optimization, we can transform DualDICE into a more familiar format that considers rewards and value functions. To begin, we state the following theorem. Theorem 2 Given an MDP (S, A, •, p, d 0 , γ), policy π, and function f : S × A → R, define the reward function r : S × A → R where r(s, a) = f (s, a) -γE s ,a ∼π [f (s , a )]. Then it follows that the value function Qπ defined by the policy π, MDP, and reward function r, is the function f . The proof follows naturally from the Bellman equation (Bellman, 1957) . Informally, Theorem 2 states that any function f can be treated as an exact value function Qπ , for a carefully chosen reward function r(s, a) = f (s, a) -γE s ,π [f (s , a )]. Theorem 2 provides two perspectives on DualDICE. By replacing terms in Equation ( 11) with rewards and value functions, it can be viewed as the same objective as Equation ( 7) from SR-DICE: min r(s,a)∀(s,a) J(r) := 1 2 E (s,a)∼d D r(s, a) 2 -(1 -γ)E s0,a0 Qπ (s 0 , a 0 ) . ( ) The first insight from this relationship is that like SR-DICE, DualDICE can be viewed as reward function optimization and still requires some element of value learning. However, for DualDICE the form of the reward and value functions are unique. From Theorem 2, we remark that f (s 0 , a 0 ) is always exactly Qπ (s 0 , a 0 ) without additional computation. This occurs because f (s 0 , a 0 ) is not a function of the reward, rather, the rewards are defined as a function of f . When the reward function is adjusted, f (s 0 , a 0 ) may remain unchanged and other rewards are adjusted to compensate. To emphasize how DualDICE is subject to the properties of value learning, consider a second perspective on DualDICE taken from Theorem 2, where we replace f with Qπ : min Qπ J( Qπ ) := 1 2 E (s,a)∼d D Qπ (s, a) -γE s ,π Qπ (s , a ) 2 Bellman residual minimization -(1 -γ)E s0,a0 Qπ (s 0 , a 0 ) . The first term is equivalent to Bellman residual minimization (Bellman, 1957; Baird, 1995) , where the reward is 0 for all state-action pairs. The second term attempts to maximize only the initial value function Qπ (s 0 , a 0 ). From a practical perspective this relationship is concerning as the first term relies on successfully propagating updates throughout the MDP to balance out with changes to the initial values, which may occur quickly. Consequently, in cases where DualDICE performs poorly, we may see the initial values approach infinity. To understand how this objective performs empirically, we measured the output of DualDICE on a basic OPE task with an identical behavior and target policy. In this case the true MIS ratio is 1.0 for all state-action pairs. Consequently, both the fictiuous reward E d D [f (s, a)-γE s ,π [f (s , a )]] and normalized initial value function (1 -γ)E s0,a0 [f (s 0 , a 0 )] should approach 1.0. In Figure 1, we graph both E d D [w(s, a)], where w(s, a) ≈ f (s, a) - γE s ,π [f (s , a ) ] is the ratio used by DualDICE (Equation ( 5)), and (1 -γ)E s0,a0 [f (s 0 , a 0 )] output by DualDICE. While on the easier task, Pendulum, the performance looks reasonable, on HalfCheetah we can see that (1 -γ)E s0,a0 [f (s 0 , a 0 )] greatly overestimates and E d D [w(s, a)] is highly unstable. This result is intuitive given the form of Equation ( 13), where the first term, which w(s, a) approximates, is pushed slowly towards 0 and the second term is pushed towards ∞. On the lower dimensional problem, Pendulum, the objective is optimized more easily and both terms approach 1.0. On the harder problem, HalfCheetah, we can see how balancing residual learning, which is notoriously slow (Baird, 1995) , with a maximization term on initial states creates a difficult optimzation procedure. These results highlight the importance, and challenge, of propagating updates through the MDP. MIS methods are not fundamentally different than value-based methods, and viewing them as such may allow us to develop richer foundations for MIS.

4. RELATED WORK

Off-Policy Evaluation. Off-policy evaluation is a well-studied problem with several families of approaches. One family of approaches is based on importance sampling, which re-weights trajectories by the ratio of likelihoods under the target and behavior policy (Precup et al., 2001) . Importance sampling methods are unbiased but suffer from variance which can grow exponentially with the length of trajectories (Li et al., 2015; Jiang & Li, 2016) . Consequently, research has focused on variance reduction (Thomas & Brunskill, 2016; Munos et al., 2016; Farajtabar et al., 2018) or contextual bandits (Dudík et al., 2011; Wang et al., 2017) . Marginalized importance sampling methods (Liu et al., 2018) aim to avoid this exponential variance by considering the ratio in stationary distributions, giving an estimator with variance which is polynomial with respect to horizon (Xie et al., 2019; Liu et al., 2019a) . Follow-up work has introduced a variety of approaches and improvements, allowing them to be behavior-agnostic (Nachum et al., 2019a; Uehara & Jiang, 2019; Mousavi et al., 2020) and operate in the undiscounted setting (Zhang et al., 2020a; c) . In a similar vein, some OPE methods rely on emphasizing, or re-weighting, updates based on their stationary distribution (Sutton et al., 2016; Mahmood et al., 2017; Hallak & Mannor, 2017; Gelada & Bellemare, 2019) , or learning the stationary distribution directly (Wang et al., 2007; 2008) . Successor Representation. Introduced originally by Dayan (1993) as an approach for improving generalization in temporal-difference methods, successor representations (SR) were revived by recent work on deep successor RL (Kulkarni et al., 2016) and successor features (Barreto et al., 2017) which demonstrated that the SR could be generalized to a function approximation setting. The SR has found applications for task transfer (Barreto et al., 2018; Grimm et al., 2019) , navigation (Zhang et al., 2017; Zhu et al., 2017) , and exploration (Machado et al., 2018a; Janz et al., 2019) . It has also been used in a neuroscience context to model generalization and human reinforcement learning (Gershman et al., 2012; Momennejad et al., 2017; Gershman, 2018) . The SR and our work also relate to state representation learning (Lesort et al., 2018) and general value functions (Sutton & Tanner, 2005; Sutton et al., 2011) .

5. EXPERIMENTS

To evaluate our method, we perform several off-policy evaluation (OPE) experiments on a variety of domains. The aim is to evaluate the normalized average discounted reward E (s,a)∼d π ,r [r(s, a)] of a target policy π. We benchmark our algorithm against two MIS methods, DualDICE (Nachum et al., 2019a) and GradientDICE (Zhang et al., 2020c) , two deep RL approaches and the true return of the behavior policy. The first deep RL method is a DQN-style approach (Mnih et al., 2015) where actions are selected by π (denoted Deep TD) and the second is the deep SR where the weight w is trained to minimize the MSE between w φ(s, a) and r(s, a) (denoted Direct-SR) (Kulkarni et al., 2016) . Environment-specific experimental details are presented below and complete algorithmic and hyper-parameter details are included in the supplementary material. Continuous-Action Experiments. We evaluate the methods on a variety of MuJoCo environments (Brockman et al., 2016; Todorov et al., 2012) . We examine two experimental settings. In both settings the target policy π and behavior policy π b are stochastic versions of a deterministic policy π d obtained from training the TD3 algorithm (Fujimoto et al., 2018) . We evaluate a target policy π = π d + N (0, σ 2 ), where σ = 0.1. • For the "easy" setting, we gather a data set of 500k transitions using a behavior policy π b = π d + N (0, σ 2 b ) , where σ b = 0.133. This setting roughly matches the experimental setting defined by Zhang et al. (2020a) . • For the "hard" setting, we gather a significantly smaller data set of 50k transitions using a behavior policy which acts randomly with p = 0.2 and uses π d + N (0, σ 2 b ), where σ b = 0.2, with p = 0.8. Unless specified otherwise, we use a discount factor of γ = 0.99 and all hyper-parameters are kept constant across environments. All experiments are performed over 10 seeds. We display the results of the "easy" setting in Figure 2 and the "hard" setting in Figure 3 . We remark that this setting can be considered easy as the behavior policy achieves a lower error, often outperforming all agents. SR-DICE significantly outperforms the other MIS methods on all environments, except for Humanoid, where GradientDICE achieves a comparable performance. Figure 3 : Off-policy evaluation results on the continuous-action MuJoCo domain using the "hard" experimental setting (50k time steps, σ b = 0.2, random actions with p = 0.2). The shaded area captures one standard deviation across 10 trials. This setting uses significantly fewer time steps than the "easy" setting and the behavior policy is a poor estimate of the target policy. Again, we see SR-DICE outperforms the MIS methods, demonstrating the benefits of our proposed decomposition and simpler optimization. This setting also shows the benefits of deep RL methods over MIS methods for OPE in high dimensional domains, as deep TD performs the strongest in every environment. Atari Experiments. We also test each method on several Atari games (Bellemare et al., 2013) , which are challenging due to their high-dimensional image-based state space. Standard preprocessing steps are applied (Castro et al., 2018) and sticky-actions are used (Machado et al., 2018b) to increase difficulty and remove determinism. Each method is trained on a data set of one million time steps. The target policy is the deterministic greedy policy trained by Double DQN (Van Hasselt et al., 2016) . The behavior policy is the -greedy policy with = 0.1. We use a discount factor of γ = 0.99. Experiments are performed over 3 seeds. Results are displayed in Figure 4 . Additional experiments with different behavior policies can be found in the supplementary material. Discussion. Across the board we find SR-DICE significantly outperforms the MIS methods. From the MSE graphs, we can see SR-DICE achieves much lower error in every task. Looking at the estimated values of R(π) in the continuous-action environments, Figure 3 , we can see that SR-DICE converges rapidly and maintains a stable estimate, while the MIS methods are particularly unstable, especially in the case of DualDICE. These observations are consistent in the Atari domain (Figure 4 ). Overall, we find the general trend in performance is Deep TD > SR-DICE = Direct-SR > MIS. Notably Direct-SR and SR-DICE perform similarly in every task, suggesting that the limiting factor in SR-DICE is the quality of the deep successor representation. Ablation. To study the robustness of SR-DICE relative to the competing methods, we perform an ablation study and investigate the effects of data set size, discount factor, and two different behavior policies. Unless specified otherwise, we use experimental settings matching the "hard" setting. We report the results in Figure 5 . In the data set size experiment (a), SR-DICE perform well with as few as 5k transitions (5 trajectories). In some instances, the performance is unexpectedly improved with less data, although incrementally. For small data sets, the SR methods outperform Deep TD. One hypothesis is that the encoding acts as an auxiliary reward and helps stabilize learning in the low data regime. In (b) we report the performance over changes in discount factor. The relative ordering across methods is unchanged. In (c) we use a behavior policy of N (0, σ 2 b ), with σ b = 0.5, a much larger standard deviation than either setting for continuous control. The results are similar to the original setting, with an increased bias on the deep RL methods. In (d) we use the underlying deterministic policy as both the behavior and target policy. The baseline MIS methods perform surprisingly poorly, once again demonstrating their weakness on harder domains.

6. CONCLUSION

In this paper, we introduce a method which can perform marginalized importance sampling (MIS) using the successor representation (SR) of the target policy. This is achieved by deriving an MIS formulation that can be viewed as reward function optimization. By using the SR, we effectively disentangle the dynamics of the environment from learning the reward function. This allows us to (a) use well-known deep RL methods to effectively learn the SR in challenging domains (Mnih et al., 2015; Kulkarni et al., 2016) and (b) provide a straightforward loss function to learn the density ratios without any optimization tricks necessary for previous methods (Liu et al., 2018; Uehara & Jiang, 2019; Nachum et al., 2019a; Zhang et al., 2020c) . This reward function interpretation also provides insight into prior MIS methods by showing how they are connected to value-based methods. Our resulting algorithm, SR-DICE, outperforms prior MIS methods in terms of both performance and stability and is the first MIS method which demonstrably scales to high-dimensional problems. As a secondary finding, our benchmarking shows that current MIS methods underperform more traditional value-based methods at OPE on high-dimensional tasks, suggesting that for practical applications, deep RL approaches should still be preferred. Regardless, outside of OPE there exists a wealth of possible applications for MIS ratios, from imitation (Kostrikov et al., 2019) to policy optimization (Imani et al., 2018; Liu et al., 2019b; Zhang et al., 2019) to mitigating distributional shift in offline RL (Fujimoto et al., 2019b; Kumar et al., 2019) . For ease of use, our code is provided, and we hope our algorithm and insight will provide valuable contributions to the field. Qπ (s 0 , a 0 ) (14)

A DETAILED PROOFS.

:= 1 2 E (s,a)∼d D r(s, a) 2 -E (s,a)∼d π [r(s, a)] . Proof. Take the partial derivative of J(r) with respect to r(s, a): ∂ ∂ r(s, a) 1 2 E (s,a)∼d D r(s, a) 2 -E (s,a)∼d π [r(s, a)] = d D (s, a)r(s, a) -d π (s, a). ( ) Then setting ∂J(r) ∂ r(s,a) = 0, we have that J(r) is minimized when r(s, a) = d π (s,a) d D (s,a) for all state-action pairs (s, a).  min w J(w) := 1 2 E d D (w φ(s, a)) 2 -(1 -γ)E s0,a0∼π w ψ π (s 0 , a 0 ) . Proof. From our assumption we have: First note the least squares estimator of S×A w φ(s, a) -d π (s,a) min w J(w) := 1 2 E d D (w φ(s, a)) 2 -E (s,a)∼d π w φ(s, a) . d D (s,a) 2 d(s, a) = Φw -dπ d D 2 is ŵ = (Φ T Φ) -1 Φ T d π d D , where the division is element-wise. Now consider our optimization problem: J(w) = 0.5d D (Φw) 2 -d π Φw = 0.5d D (Φw) T Φw -d π Φw = 0.5d D w T Φ T Φw -d π Φw. (19) Now take gradient of J(w) with respect to w and set it equal to 0: ∇ w 0.5d D w T Φ T Φw -d π Φw = 0 d D w T Φ T Φ = d π Φd π w T Φ T Φ = d π d D Φ Φ T Φw = Φ d π d D w = (Φ T Φ) -1 Φ d π d D . It follows that the optimizer to J(w) is the least squares estimator. Tsitsiklis, 1996) . This result can also be obtained by considering state-action reward shaping (Wiewiora et al., 2003) , and treating f as the potential function.

B TABULAR SR-DICE

The experimental performance of SR-DICE, in particular its reliance on the SR can be partially explained by examining its tabular counterpart. In fact, we can show that the value estimate derived from SR-DICE is exactly equal to a value estimate derived directly from the SR. Recall the form of SR-DICE's objective in a tabular setting, as a function of the SR Ψ π : min r(s,a)∀(s,a) J Ψ (r) := 1 2 E (s,a)∼d D r(s, a) 2 -(1 -γ)E s0 s Ψ π (s|s 0 )E a∼π [r(s, a)] . ( ) Which has the following gradient: ∇ r(s,a) J Ψ (r) := d D (s, a)r(s, a) -(1 -γ) s0 p(s 0 )Ψ π (s|s 0 )π(a|s). ( ) We can compute this gradient from samples. Define D 0 as the set of start states s 0 ∈ D. It follows: 25) and expand and simplify: ∇ r(s,a) J Ψ (r) := 1 |D|   (s ,a )∈D 1(s = s, a = a)   r(s, a) -(1 -γ) 1 |D 0 | s0∈D0 Ψ π (s|s 0 )π(a|s). 1 |D| (s,a)∈D (1 -γ) 1 |D 0 | s0∈D0 Ψ π (s|s 0 )π(a|s) |D| (s ,a )∈D 1(s = s, a = a) r(s, a) (26) = (1 -γ) 1 |D 0 | s0∈D0 (s,a)∈D Ψ π (s|s 0 )π(a|s) 1 (s ,a )∈D 1(s = s, a = a) r(s, a) (27) = (1 -γ) 1 |D 0 | s0∈D0 (s,a)∈S×A Ψ π (s|s 0 )π(a|s)r(s, a). Noting that (s,a)∈S×A Ψ π (s|s 0 )π(a|s)r(s, a) = V π (s 0 ) we can see that SR-DICE returns the same solution as the SR solution for estimating R(π).

C ADDITIONAL EXPERIMENTS

In this section, we include additional experiments and visualizations, covering extra domains, additional ablation studies, run time experiments and additional behavior policies in the Atari domain.

C.1 EXTRA CONTINUOUS DOMAINS

Although our focus is on high-dimensional domains, the environments, Pendulum and Reacher, have appeared in several related MIS papers (Nachum et al., 2019a; Zhang et al., 2020a) . Therefore, we have included results for these domains in Figure 6 . All experimental settings match the experiments in the main body, and are described fully in Appendix F. 

C.2 REPRESENTATION LEARNING & MIS

SR-DICE relies a disentangled representation learning phase where an encoding φ is learned, followed by the deep successor representation ψ π which are used with a linear vector w to estimate the density ratios. In this section we perform some experiments which attempt to evaluate the importance of representation learning by comparing their influence on the baseline MIS methods.

Alternate representations.

We examine both DualDICE (Nachum et al., 2019a) and GradientDICE (Zhang et al., 2020c) under four settings where we pass the representations φ and ψ π to their networks, where both φ and ψ π are learned in identical fashion to SR-DICE. See Appendix E for specific details on the baselines. We report the results in Figure 7 . For GradientDICE, no benefit is provided by varying the representations, although using the encoding φ matches the performance of vanilla GradientDICE regardless of the choice of network, providing some validation that φ is a reasonable encoding. Interestingly, for DualDICE, we see performance gains from using the SR ψ π as a representation: slightly as input, but significantly when used with linear networks. On the other hand, as GradientDICE performs much worse with the SR, it is clear that the SR cannot be used as a representation without some degree of forethought. The shaded area captures one standard deviation across 10 trials. We see that using the SR ψ π as a representation improves the performance of DualDICE. On the other hand, GradientDICE performs much worse when using the SR, suggesting it cannot be used naively to improve MIS methods. Increased capacity. As SR-DICE uses a linear function on top of a representation trained with the same capacity as the networks in DualDICE and GradientDICE, our next experiment examines if this additional capacity provides benefit to the baseline methods. To do, we expand each network in both baselines by adding an additional hidden layer. The results are reported in Figure 8 . We find there is a very slight decrease in performance when using the larger capacity networks. This suggests the performance gap from SR-DICE over the baseline methods has little to do with model size. Figure 8 : Off-policy evaluation results on HalfCheetah evaluating the performance benefits from larger network capacity on the baseline MIS methods. "Big" refers to the models with an additional hidden layer. The experimental setting corresponds to the "hard" setting from the main body. The shaded area captures one standard deviation across 10 trials. We find that there is no clear performance benefit from increasing network capacity.

C.3 TOY DOMAINS

We additional test the MIS algorithms on a toy random-walk experiment with varying feature representations, based on a domain from (Sutton et al., 2009) . Domain. The domain is a simple 5-state MDP (x 1 , x 2 , x 3 , x 4 , x 5 ) with two actions (a 0 , a 1 ), where action a 0 induces the transition x i → x i-1 and action a 1 induces the transition x i → x i+1 , with the state x 1 looping to itself with action a 0 and x 5 looping to itself with action a 5 . Episodes begin in the state x 1 . Target. We evaluate policy π which selects actions uniformly, i.e. π(a 0 |x i ) = π(a 1 |x i ) = 0.5 for all states x i . Our data set D contains all 10 possible state-action pairs and is sampled uniformly. We use a discount factor of γ = 0.99. Methods are evaluated on the average MSE between their estimate of d π d D on all state-action pairs and the ground-truth value, which is calculated analytically. Hyper-parameters. Since we are mainly interested in a function approximation setting, each method uses a small neural network with two hidden layers of 32, followed by tanh activation functions. All networks used stochastic gradient descent with a learning rate α tuned for each method out of {1, 0.5, 0.1, 0.05, 0.01, 0.001}. This resulted in α = 0.05 for DualDICE, α = 0.1 for GradientDICE, and α = 0.05 for SR-DICE. Although there are a small number of possible data points, we use a batch size of 128 to resemble the regular training procedure. As recommended by the authors we use λ = 1 for GradientDICE (Zhang et al., 2020c) , which was not tuned. For SR-DICE, we update the target network at every time step τ = 1, which was not tuned. Since there are only 10 possible state-action pairs, we use the closed form solution for the vector w (Equation ( 10)). Additionally, we skip the state representation phase of SR-DICE, instead learning the SR ψ π over the given representation of each state, such that the encoding φ = x. This allows us to test SR-DICE to a variety of representations rather than using a learned encoding. Consequently, with these choices, SR-DICE has no pre-training phase, and therefore, unlike every other graph in this paper, we report the results as the SR is trained, rather than as the vector w is trained. Features. To test the robustness of each method we examine three versions of the toy domain, each using a different feature representation over the same 5-state MDP. These feature sets are again taken from (Sutton et al., 2009) . • Tabular features: states are represented by a one-hot encoding, for example x 2 = [0, 1, 0, 0, 0]. • Inverted features: states are represented by the inverse of a one-hot encoding, for example x 2 = 1 2 , 0, 1 2 , 1 2 , 1 2 . • Dependent features: states are represented by 3 features which is not sufficient to cover all states exactly. In this case x 1 = [1, 0, 0], x 2 = [ 1 √ 2 , 1 √ 2 , 0], x 3 = [ 1 √ 3 , 1 √ 3 , 1 √ 3 ], x 4 = [0, 1 √ 2 , 1 √ 2 ], x 5 = [0, 0, 1]. Since our experiments use neural networks rather than linear functions, this representation is mainly meant to test SR-DICE, where we skip the state representation phase for SR-DICE and use the encoding φ = x, limiting the representation of the SR. Results. We report the results in Figure 9 . We remark on several observations. SR-DICE learns significantly faster than the baseline methods, likely due to its use of temporal difference methods in the SR update, rather than using an update similar to residual learning, which is notoriously slow (Baird, 1995; Zhang et al., 2020b ). GradientDICE appears to still be improving, although we limit training at 50k time steps, which we feel is sufficient given the domain is deterministic and only has 5 states. Notably, GradientDICE also uses a higher learning rate than SR-DICE and DualDICE. We also find the final performance of SR-DICE is much better than DualDICE and GradientDICE in the domains where the feature representation is not particularly destructive, highlighting the easier optimization of SR-DICE. In the case of the dependent features, we find DualDICE outperforms SR-DICE after sufficient updates. However, we remark that this concern could likely be resolved by learning the features and that SR-DICE still outperforms GradientDICE. Overall, we believe these results demonstrate that SR-DICE's strong empirical performance is consistent across simpler domains as well as the high dimensional domains we examine in the main body.

C.4 RUN TIME EXPERIMENTS

In this section, we evaluate the run time of each algorithm used in our experiments. Although SR-DICE relies on pre-training the deep successor representation before learning the density ratios, we find each marginalized importance sampling (MIS) method uses a similar amount of compute, due to the reduced cost of training w after the pre-training phase. We evaluate the run time on the HalfCheetah environment in MuJoCo (Todorov et al., 2012) and OpenAI gym (Brockman et al., 2016) 

C.5 ATARI EXPERIMENTS

To better evaluate the algorithms in the Atari domain, we run two additional experiments where we swap the behavior policy. We observe similar trends as the experiments in the main body of the paper. In both experiments we keep all other settings fixed. Notably, we continue to use the same target policy, corresponding to the greedy policy trained by Double DQN (Van Hasselt et al., 2016) , the same discount factor γ = 0.99, and the same data set size of 1 million. Increased noise. In our first experiment, we attempt to increase the randomness of the behavior policy. As this can cause destructive behavior in the performance of the agent, we adopt an episodedependent policy which selects between the noisy policy or the deterministic greedy policy at the beginning of each episode. This is motivated by the offline deep reinforcement learning experiments from (Fujimoto et al., 2019a) . As a result, we use an -greedy policy with p = 0.8 and the deter-ministic greedy policy (the target policy) with p = 0.2. is set to 0.2, rather than 0.1 as in the experiments in the main body of the paper. Results are reported in Figure 11 . Figure 11 : We plot the log MSE for off-policy evaluation in the image-based Atari domain, using an episodedependent noisy policy, where = 0.2 with p = 0.8 and = 0 with p = 0.2. This episode-dependent selection ensures sufficient state-coverage while using a stochastic policy. The shaded area captures one standard deviation across 3 trials. Markers are not placed at every point for visual clarity. We observe very similar trends to the original set of experiments. Again, we note DualDICE and GradientDICE perform very poorly, while SR-DICE, Direct-SR, and Deep TD achieve a reasonable, but biased, performance. In this setting, we still find the behavior policy is the closest estimate of the true value of R(π) . Separate behavior policy. In this experiment, we use a behavior which is distinct from the target policy, rather than simply adding noise. This behavior policy is derived from an agent trained with prioritized experience replay and Double DQN (Schaul et al., 2016) . Again, we use a -greedy policy, with = 0.1. We report the results in Figure 12 . Figure 12 : We plot the log MSE for off-policy evaluation in the image-based Atari domain, using a distinct behavior policy, trained by a separate algorithm, from the target policy. This experiment tests the ability to generalize to a more off-policy setting. The shaded area captures one standard deviation across 3 trials. Markers are not placed at every point for visual clarity. Again, we observe similar trends in performance. Notably, in the Asterix game, the performance of Direct-SR surpasses the behavior policy, suggesting off-policy evaluation can outperform the naïve estimator in settings where the policy is sufficiently "off-policy" and distinct.

D SR-DICE PRACTICAL DETAILS

In this section, we cover some basic implementation-level details of SR-DICE. Note that code is provided for additional clarity. SR-DICE uses two parametric networks, an encoder-decoder network to learn the encoding φ and a deep successor representation network ψ π . Additionally, SR-DICE uses the weights of a linear function w. SR-DICE begins by pre-training the encoder-decoder network and the deep successor representation before applying updates to w. Encoder-Decoder. This encoder-decoder network encodes (s, a) to the feature vector φ(s, a), which is then decoded by several decoder heads. For the Atari domain, we choose to condition the feature vector only on states φ(s), as the reward is generally independent of the action selection. This For the Atari games, given a mini-batch transition (s, a, r, s ), the encoder-decoder network is trained to map the state s to the next state s and reward r, while penalizing the size of φ(s). The resulting loss function is as follows: min φ,D s ,Dr L(φ, D) := λ s (D s (φ(s)) -s ) 2 + λ r (D r (φ(s)) -r) 2 + λ φ φ(s) 2 . ( ) We use λ s = 1, λ r = 0.1 and λ φ = 0.1. Deep Successor Representation. The deep successor representation ψ π is trained to estimate the accumulation of φ. The training procedure resembles standard deep reinforcement learning algorithms. Given a mini-batch of transitions (s, a, r, s ) the network is trained to minimize the following loss: min ψ π L(ψ π ) := (φ(s, a) + γψ (s , a ) -ψ π (s, a)) 2 , ( ) where ψ is the target network. A target network is a frozen network used to provide stability (Mnih et al., 2015; Kulkarni et al., 2016) in the learning target. The target network is updated to the current network ψ ← ψ π after a fixed number of time steps, or updated with slowly at each time step Lillicrap et al., 2015) . ψ ← τ ψ π + (1 -τ )ψ π ( Marginalized Importance Sampling Weights. As described in the main body, we learn w by optimizing the following objective: min w J(w) := 1 2 E (s,a)∼d D (w φ(s, a)) 2 -(1 -γ)E s0,a0∼π w ψ π (s 0 , a 0 ) . This is achieved by sampling state-action pairs uniformly from the data set D, alongside a mini-batch of start states s 0 , which are recorded at the beginning of each episode during data collection. We summarize the learning procedure of SR-DICE in Algorithm 2.

E BASELINES

In this section, we cover some of the practical details of each of the baseline methods. E.1 DUALDICE Dual stationary DIstribution Correction Estimation (DualDICE) (Nachum et al., 2019a) uses two networks f and w. The general optimization problem is defined as follows: min f max w J(f, w) := E (s,a)∼d D ,a ∼π,s w(s, a)(f (s, a) -γf (s , a )) -0.5w(s, a) 2 -(1 -γ)E s0,a0 [f (s 0 , a 0 )]. In practice this corresponds to alternating single gradient updates to f and w. The authors suggest possible alternative functions to the convex function 0.5w(s, a) 2 such as 2 3 |w(s, a)| 3 2 , however in practice we found 0.5w(s, a) 2 performed the best.

E.2 GRADIENTDICE

Gradient stationary DIstribution Correction Estimation (GradientDICE) (Zhang et al., 2020c) uses two networks f and w, and a scalar u. The general optimization problem is defined as follows: min w max f,u J(w, u, f ) := (1 -γ)E s0,a0 [f (s 0 , a 0 )] + γE (s,a)∼d D ,a ∼π,s [w(s, a)f (s , a )] -E (s,a)∼d D [w(s, a)f (s, a)] + λ E (s,a)∼d D [uw(s, a) -u] -0.5u 2 . (34) Similarly to DualDICE, in practice this involves alternating single gradient updates to w, u and f . As suggested by the authors we use λ = 1.

E.3 DIRECT-SR

Direct-SR is a policy evaluation version of deep successor representation (Kulkarni et al., 2016) . The encoder-decoder network and deep successor representation are trained in the exact same manner as SR-DICE (see Section D). Then, rather than train w to learn the marginalized importance sampling ratios, w is trained to recover the original reward function. Given a mini-batch of transitions (s, a, r, s ), the following loss is applied: min w L(w) := (r -w φ(s, a)) 2 . E.4 DEEP TD Deep TD, short for deep temporal-difference learning, takes the standard deep reinforcement learning methodology, akin to DQN (Mnih et al., 2015) , and applies it to off-policy evaluation. Given a mini-batch of transitions (s, a, r, s ) the Q-network is updated by the following loss: min Q π L(Q π ) := (r + γQ (s , a ) -Q π (s, a)) 2 , ( ) where a is sampled from the target policy π(•|s ). Similarly, to training the deep successor representation, Q is a frozen target network which is updated to the current network after a fixed number of time steps, or incrementally at every time step.

F EXPERIMENTAL DETAILS

All networks are trained with PyTorch (version 1.4.0) (Paszke et al., 2019) . Any unspecified hyperparameter uses the PyTorch default setting. There are three decoders for reward, action, and next state, respectively. For the action decoder and next state decoder we use a network with one hidden layer of 256. The reward decoder is a linear function of the encoding, without biases. All hidden layers are followed by ReLU activation functions. Network hyper-parameters. All networks are trained with the Adam optimizer (Kingma & Ba, 2014) . We use a learning rate of 3e-4, again based on TD3 for all networks except for GradientDICE, which we found required careful tuning to achieve a reasonable performance. For GradientDICE we found a learning rate of 1e-5 for f and w, and 1e-2 for u achieved the highest performance. For DualDICE we chose the best performing learning rate out of {1e -2, 1e -3, 3e -4, 5e -5, 1e -5}. SR-DICE, Direct-SR and Deep TD were not tuned and use default hyper-parameters from deep RL algorithms. For training ψ π and Q π for the deep reinforcement learning aspects of SR-DICE, Direct-SR and Deep TD we use a mini-batch size of 256 and update the target networks using τ = 0.005, again based on TD3. For all MIS methods, we use a mini-batch size of 2048 as described by (Nachum et al., 2019a) . We found SR-DICE and DualDICE succeeded with lower mini-batch sizes but did not test this in detail. All hyper-parameters are described in Table 2 . Visualizations. We graph the log MSE between the estimate of R(π) and the true R(π), where the log MSE is computed as log 0.5(X -R(π)) 2 . We smooth the learning curves over a uniform window of 10. Agents were evaluated every 1k time steps and performance is measured over 250k time steps total. Markers are displayed every 25k time steps with offset for visual clarity.

F.2 ATARI

We interface with Atari by OpenAI gym (version 0.17.2) (Brockman et al., 2016) , all agents use the NoFrameskip-v0 environments that include sticky actions with p = 0.25 (Machado et al., 2018b) . Pre-processing. We use standard pre-processing steps based on Machado et al. (2018b) and Castro et al. (2018) . We base our description on (Fujimoto et al., 2019a) , which our code is closely based on. We define the following: • Frame: output from the Arcade Learning Environment. • State: conventional notion of a state in a MDP. • Input: input to the network. The standard pre-processing steps are as follows: • Frame: gray-scaled and reduced to 84 × 84 pixels, tensor with shape (1, 84, 84) . • State: the maximum pixel value over the 2 most recent frames, tensor with shape (1, 84, 84) . • Input: concatenation over the previous 4 states, tensor with shape (4, 84, 84). The notion of time steps is applied to states, rather than frames, and functionally, the concept of frames can be abstracted away once pre-processing has been applied to the environment. The agent receives a state every 4th frame and selects one action, which is repeated for the following 4 frames. If the environment terminates within these 4 frames, the state received will be the last 2 frames before termination. For the first 3 time steps of an episode, the input, which considers the 



Importance Sampling. Marginalized importance sampling (MIS) is a family of importance sampling approaches for off-policy evaluation in which the performance R(π) is evaluated by re-weighting rewards sampled from a data set D = {(s, a, r, s )} ∼ p(s |s, a)d D (s, a), where d D is an arbitrary distribution, typically but not necessarily, induced by some behavior policy. It follows that R(π) can computed with importance sampling weights on the rewards d π (s,a) d D (s,a) : R(π) = E (s,a)∼d D ,r d π (s, a) d D (s, a) r(s, a) .

Figure 1: We plot the average values of E d D [w(s, a)] and (1 -γ)Es 0 ,a 0 [f (s0, a0)] output by DualDICE on a task with an identical behavior and target policy, such that the true value of both terms is 1. The 10 individual trials are plotted lightly, with the mean in bold. The estimates of DualDICE matches our hypothesis that DualDICE overestimates f (s0, a0) as propagating updates through the MDP occurs at a much slower rate.

Figure2: Off-policy evaluation results on the continuous-action MuJoCo domain using the "easy" experimental setting (500k time steps and σ b = 0.133). The shaded area captures one standard deviation across 10 trials. We remark that this setting can be considered easy as the behavior policy achieves a lower error, often outperforming all agents. SR-DICE significantly outperforms the other MIS methods on all environments, except for Humanoid, where GradientDICE achieves a comparable performance.

Figure4: We plot the log MSE for off-policy evaluation in the image-based Atari domain. The shaded area captures one standard deviation across 3 trials. We can see the MIS baselines diverge on this challenging environment, while the remaining methods perform similarly. Perhaps surprisingly, on most games, the naïve baseline of using R(π b ) from the behavior policy outperforms all methods by a fairly significant margin. Although the estimates from deep RL methods are stable, they are biased, resulting in a higher MSE.

The objective J(r) is minimized when r(s, a) = d π (s,a) d D (s,a) , ∀(s, a). a)∼d D r(s, a) 2 -(1 -γ)E s0,a0

Assuming (1 -γ)E s0,a0 [ψ π (s 0 , a 0 )] = E (s,a)∼d π [φ(s, a)], then the optimizer w * of the objective J(w) is the least squares estimator of S×A w φ(s, a) -d π (s,a)

) Let M = |S| × |A| and N be the feature dimension. Let φ(s, a) be a N × 1 feature vector and Φ the M × N matrix where each row corresponds to a φ(s, a) vector. Let w be a N × 1 vector of parameters. Let d π and d D be M × 1 vectors of the values of d π (s, a) and d D (s, a) for all (s, a).

Given an MDP (S, A, •, p, d 0 , γ), policy π, and function f : S × A → R, define the reward function r : S × A → R as r(s, a) = f (s, a) -γE s ,a ∼π [f (s , a )]. Then it follows that the value function Qπ defined by the policy π, MDP, and reward function r, is the function f . Proof. Define r(s, a) = f (s, a) -γE π [f (s , a )]. Then note for all state-action pairs (s, a) we have f (s, a) = r(s, a) + γE π [f (s , a )] = f (s, a) -γE π [f (s , a )] + γE π [f (s , a )] = f (s, a), satisfying the Bellman equation. It follows that f = Qπ by the uniqueness of the value function (Bertsekas &

gradient to 0 and solving for r(s, a) we have the optimizer of J Ψ (r).r(s, a) = (1 -γ) 1 |D 0 | s0∈D0 Ψ π (s|s 0 )π(a|s) |D| (s ,a )∈D 1(s = s, a = a) . (24)Now consider the MIS equation for estimating the objectiveR(π) = (1 -γ)E s0 [V π (s 0 )],where r is an estimate of d π (s,a) d D (s,a) : assume every state-action pair (s, a) is contained at least once in D. Although the result holds regardless, this assumption allows us to avoid some cumbersome details. Replace d π (s,a) d D (s,a) with r in Equation (

Figure6: Off-policy evaluation results for Pendulum and Reacher. The shaded area captures one standard deviation across 10 trials. Even on these easier environment, we find that SR-DICE outperforms the baseline MIS methods.

Input encoding φ, f (φ(s, a)), w(φ(s, a)). (2) Input SR ψ π , f (ψ π (s, a)), w(ψ π (s, a)). (3) Input encoding φ, linear networks, f φ(s, a), w φ(s, a). (4) Input SR ψ π , linear networks, f ψ π (s, a), w ψ π (s, a).

Figure 7: Off-policy evaluation results on HalfCheetah examining the value of differing representations added to the baseline MIS methods. The experimental setting corresponds to the "hard" setting from the main body.The shaded area captures one standard deviation across 10 trials. We see that using the SR ψ π as a representation improves the performance of DualDICE. On the other hand, GradientDICE performs much worse when using the SR, suggesting it cannot be used naively to improve MIS methods.

,&( *UDGLHQW',&( %LJ'XDO',&( %LJ*UDGLHQW',&( %HKDYLRUR(π b )

Figure 9: Results measuring the log MSE between the estimated density ratio and the ground-truth on a simple 5-state MDP domain with three feature sets. The shaded area captures one standard deviation across 10 trials. Results are evaluated every 100 time steps over 50k time steps total.

Figure10: The average run time of each off-policy evaluation approach in minutes. Each experiment is run for 250k time steps and is averaged over 3 seeds. SR-DICE and Direct-SR pre-train encoder-decoder for 30k time steps and the deep successor representation 100k time steps.

both SR-DICE and Direct-SR. Most design decisions are inspired by prior work(Machado et al., 2017; 2018a).For continuous control, given a mini-batch transition (s, a, r, s ), the encoder-decoder network is trained to map the state-action pair (s, a) to the next state s , the action a and reward r. The resulting loss function is as follows: min φ,D s ,Da,Dr L(φ, D) := λ s (D s (φ(s, a)) -s ) 2 + λ a (D a (φ(s, a)) -a) 2 + λ r (D r (φ(s, a)) -r) 2 . (29) We use λ s = 1, λ a = 1 and λ r = 0.1.

At each time step sample mini-batch of N transitions (s, a, r, s ) and start states s 0 from D. 2: for t = 1 to T 1 do

. As in the main set of experiments, each method is trained for 250k time steps. Additionally, SR-DICE and Direct-SR train the encoder-decoder for 30k time steps and the deep successor representation for 100k time steps before training w. Run time is averaged over 3 seeds. All time-based experiments are run on a single GeForce GTX 1080 GPU and a Intel Core i7-6700K CPU. Results are reported in Figure10.

Continuous-action environment training hyper-parameters.

Training hyper-parameters for the Atari domain.

annex

Evaluation. The marginalized importance sampling methods are measured by the average weighted reward from transitions sampled from a replay buffer 1 N (s,a,r) w(s, a)r(s, a), with N = 10k, while the deep RL methods use (1-γ) M s0 Q(s 0 , π(a 0 )), where M is the number of episodes. Each OPE method is trained on data collected by some behavioral policy π b . We estimate the "true" normalized average discounted reward of the target and behavior policies from 100 roll-outs in the environment.F.1 CONTINUOUS-ACTION ENVIRONMENTS Our agents are evaluated via tasks interfaced through OpenAI gym (version 0.17.2) (Brockman et al., 2016) , which mainly rely on the MuJoCo simulator (mujoco-py version 1.50.1.68) (Todorov et al., 2012) . We provide a description of each environment in Table 1 . Experiments. Our experiments are framed as off-policy evaluation tasks in which agents aim to evaluate R(π) = E (s,a)∼d π ,r [r(s, a)] for some target policy π. In each of our experiments, π corresponds to a noisy version of a policy trained by a TD3 agent (Fujimoto et al., 2018) , a commonly used deep reinforcement learning algorithm. Denote π d , the deterministic policy trained by TD3 using the author's GitHub https://github.com/sfujim/TD3. The target policy is defined as: π + N (0, σ 2 ), where σ = 0.1. The off-policy evaluation algorithms are trained on a data set generated by a single behavior policy π b . The experiments are done with two settings "easy" and "hard" which vary the behavior policy and the size of the data set. All other settings are kept fixed.For the "easy" setting the behavior policy is defined as:and 500k time steps are collected (approximately 500 trajectories for most tasks). The "easy" setting is roughly based on the experimental setting from Zhang et al. (2020a) . For the "hard" setting the behavior policy adds an increased noise and selects random actions with p = 0.2:and only 50k time steps are collected (approximately 50 trajectories for most tasks). For Pendulum-v0 and Humanoid-v3, the range of actions is [-2, 2] and [-0.4, 0.4] respectively, rather than [-1, 1], so we scale the size of the noise added to actions accordingly. We set the discount factor to γ = 0.99. All continuous-action experiments are over 10 seeds.Pre-training. Both SR-DICE and Direct-SR rely on pre-training the encoder-decoder and deep successor representation ψ. These networks were trained for 30k and 100k time steps respectively. As noted in Section C.4, even when including this pre-training step, both algorithm have a lower running time than DualDICE and GradientDICE.Architecture. For fair comparison, we use the same architecture for all algorithms except for DualDICE. This a fully connected neural network with 2 hidden layers of 256 and ReLU activation functions. This architecture was based on the network defined in the TD3 GitHub and was not tuned. For DualDICE, we found tanh activation functions improved stability over ReLU.For SR-DICE and SR-Direct we use a separate architecture for the encoder-decoder network. The encoder is a network with a single hidden layer of 256, making each φ(s, a) a feature vector of 256.previous 4 states, sets the non-existent states to all 0s. An episode terminates after the game itself terminates, corresponding to multiple lives lost (which itself is game-dependent), or after 27k time steps (108k frames or 30 minutes in real time). Rewards are clipped to be within a range of [-1, 1] .Sticky actions are applied to the environment (Machado et al., 2018b) , where the action a t taken at time step t, is set to the previously taken action a t-1 with p = 0.25, regardless of the action selected by the agent. Note this replacement is abstracted away from the agent and data set. In other words, if the agent selects action a at state s, the transition stored will contain (s, a), regardless if a is replaced by the previously taken action.Experiments. For the main experiments we use a behavior and target policy derived from a Double DQN agent (Van Hasselt et al., 2016) , a commonly used deep reinforcement learning algorithm. The behavior policy is an -greedy policy with = 0.1 and the target policy is the greedy policy (i.e. = 0). In Section C.5 we perform two additional experiments with a different behavior policy. Otherwise, all hyper-parameters are fixed across experiments. For each, the data set contains 1 million transitions and uses a discount factor of γ = 0.99. Each experiment is evaluated over 3 seeds.Pre-training. Both SR-DICE and Direct-SR rely on pre-training the encoder-decoder and deep successor representation ψ. Similar to the continuous-action tasks, these networks were trained for 30k and 100k time steps respectively.Architecture. We use the same architecture as most value-based deep reinforcement learning algorithms for Atari, e.g. (Mnih et al., 2015; Van Hasselt et al., 2016; Schaul et al., 2016) . This architecture is used for all networks, other than the encoder-decoder network, for fair comparison and was not tuned in any way.The network has a 3-layer convolutional neural network (CNN) followed by a fully connected network with a single hidden layer. As mentioned in pre-processing, the input to the network is a tensor with shape (4, 84, 84). The first layer of the CNN has a kernel depth of 32 of size 8 × 8 and a stride of 4. The second layer has a kernel depth of 32 of size 4 × 4 and a stride of 2. The third layer has a kernel depth of 64 of size 3 × 3 and a stride of 1. The output of the CNN is flattened to a vector of 3136 before being passed to the fully connected network. The fully connected network has a single hidden layer of 512. Each layer, other than the output layer, is followed by a ReLU activation function. The final layer of the network outputs |A| values where |A| is the number of actions.The encoder-decoder used by SR-DICE and SR-Direct has a slightly different architecture. The encoder is identical to the aforementioned architecture, except the final layer outputs the feature vector φ(s) with 256 dimensions and is followed by a ReLU activation function. The next state decoder uses a single fully connected layer which transforms the vector of 256 to 3136 and then is passed through three transposed convolutional layers each mirroring the CNN. Hence, the first layer has a kernel depth of 64, kernel size of 3 × 3 and a stride of 1. The second layer has a kernel depth of 32, kernel size of 4 × 4 and a stride of 2. The final layer has a kernel depth of 32, kernel size of 8 × 8 and a stride of 4. This maps to a (1, 84, 84) tensor. All layers other than the final layer are followed by ReLU activation functions. Although the input uses a history of the four previous states, as mentioned in the pre-processing section, we only reconstruct the succeeding state without history. We do this because there is overlap in the history of the current input and the input corresponding to the next time step. The reward decoder is a linear function without biases.Network hyper-parameters. Our hyper-parameter choices are based on standard hyper-parameters based largely on (Castro et al., 2018) . All networks are trained with the Adam optimizer (Kingma & Ba, 2014) . We use a learning rate of 6.25e-5. Although not traditionally though of has a hyperparameter, in accordance to prior work, we modify used by Adam to be 1.5e-4. For w we use a learning rate of 3e-4 with the default setting of = 1e-8. For u we use 1e-3. We use a mini-batch size of 32 for all networks. SR-DICE, Direct-SR and Deep TD update the target network every 8k time steps. All hyper-parameters are described in Table 3 .Visualizations. We use identical visualizations to the continuous-action environments. Graphs display the log MSE between the estimate of R(π) and the true R(π) of the target policy, where the log MSE is computed as log 0.5(X -R(π)) 2 . We smooth the learning curves over a uniform window of 10. Agents were evaluated every 1k time steps and performance is measured over 250k time steps total. Markers are displayed every 25k time steps with offset for visual clarity.

