PRACTICAL MARGINALIZED IMPORTANCE SAMPLING WITH THE SUCCESSOR REPRESENTATION

Abstract

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current stateof-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

1. INTRODUCTION

Off-policy evaluation (OPE) is a reinforcement learning (RL) task where the aim is to measure the performance of a target policy from data collected by a separate behavior policy (Sutton & Barto, 1998) . As it can often be difficult or costly to obtain new data, OPE offers an avenue for re-using previously stored data, making it an important challenge for applying RL to real-world domains (Zhao et al., 2009; Mandel et al., 2014; Swaminathan et al., 2017; Gauci et al., 2018) . Marginalized importance sampling (MIS) (Liu et al., 2018; Xie et al., 2019; Nachum et al., 2019a) is a family of OPE methods which re-weight sampled rewards by directly learning the density ratio between the state-action occupancy of the target policy and the sampling distribution. This approach can have significantly lower variance than traditional importance sampling methods (Precup et al., 2001) , which consider a product of ratios over trajectories, and is amenable to deterministic policies and behavior agnostic settings where the sampling distribution is unknown. However, the body of MIS work is largely theoretical, and as a result, empirical evaluations of MIS have mostly been carried out on simple low-dimensional tasks, such as mountain car (state dim. of 2) or cartpole (state dim. of 4). In comparison, deep RL algorithms have shown successful behaviors in high-dimensional domains such as Humanoid locomotion (state dim. of 376) and Atari (image-based). In this paper, we present a straightforward approach for MIS that can be computed from the successor representation (SR) of the target policy. Our algorithm, the Successor Representation DIstribution Correction Estimation (SR-DICE), is the first method that allows MIS to scale to highdimensional systems, far outperforming previous approaches. In comparison to previous algorithms which rely on minimax optimization or kernel methods (Liu et al., 2018; Nachum et al., 2019a; Uehara & Jiang, 2019; Mousavi et al., 2020) , SR-DICE requires only a simple convex loss applied to the linear function determining the reward, after computing the SR. Similar to the deep RL methods which can learn in high-dimensional domains, the SR can be computed easily using behavior-agnostic temporal-difference (TD) methods. This makes our algorithm highly amenable to deep learning architectures and applicable to complex tasks. Our derivation of SR-DICE also reveals an interesting connection between MIS methods and value function learning. The key motivation for MIS methods is, unlike traditional importance sampling methods, they can avoid variance with an exponential dependence on horizon, by re-weighting individual transitions rather than accumulating ratios along entire trajectories. We remark that while the MIS ratios only consider individual transitions, the optimization procedure is still subject to the dynamics of the underlying MDP. Subsequently, we use this insight to show a connection between a well-known MIS method, DualDICE (Nachum et al., 2019a) , and Bellman residual minimization (Bellman, 1957; Baird, 1995) , which can help explain some of the optimization properties and performance of DualDICE, as well as other related MIS methods. We benchmark the performance of SR-DICE on several high-dimensional domains in MuJoCo (Todorov et al., 2012) and Atari (Bellemare et al., 2013) , against several recent MIS methods. Our results demonstrate two key findings regarding high-dimensional tasks. SR-DICE significantly outperforms the benchmark algorithms. We attribute this performance gap to SR-DICE's deep RL components, outperforming the MIS baselines in the same way that deep RL outperforms traditional methods on high-dimensional domains. Unfortunately, part of this performance gap is due to the fact that the baseline MIS methods scale poorly to challenging tasks. In Atari we find that the baseline MIS method exhibit unstable estimates, often reaching errors with many orders of magnitude. MIS underperforms deep RL. Although SR-DICE achieves a high performance, we find its errors are bounded by the quality of the SR. Consequently, we find that SR-DICE and the standard SR achieve a similar performance across all tasks. Worse so, we find that using a deep TD method, comparable to DQN (Mnih et al., 2015) for policy evaluation outperforms both methods. Although the performance gap is minimal, for OPE there lacks a convincing argument for SR-DICE, or any current MIS method, which introduce unnecessary complexity. However, this does not mean MIS is useless. We remark that the density ratios themselves are an independent objective which have been used for applications such as policy regularization (Nachum et al., 2019b; Touati et al., 2020) , imitation learning (Kostrikov et al., 2019) , off-policy policy gradients (Imani et al., 2018; Liu et al., 2019b; Zhang et al., 2019) , and non-uniform sampling (Sinha et al., 2020) . SR-DICE serves as a stable, scalable approach for computing these ratios. We provide extensive experimental details in the supplementary material and our code is made available.

2. BACKGROUND

Reinforcement Learning. RL is a framework for maximizing accumulated reward of an agent interacting with its environment (Sutton & Barto, 1998) . This problem is typically framed as a Markov Decision Process (MDP) (S, A, R, p, d 0 , γ), with state space S, action space A, reward function R, dynamics model p, initial state distribution d 0 and discount factor γ. An agent selects actions according to a policy π : S × A → [0, 1]. In this paper we address the problem of off-policy evaluation (OPE) problem where the aim is to measure the normalized expected per-step reward of the policy R (π) = (1 -γ)E π [ ∞ t=0 γ t r(s t , a t )]. An important notion in OPE is the value function Q π (s, a) = E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a], which measures the expected sum of discounted rewards when following π, starting from (s, a). We define d π (s, a) as the discounted state-action occupancy, the probability of seeing (s, a) under policy π with discount γ: d π (s, a) = (1 -γ) ∞ t=0 γ t s0 d 0 (s 0 )p π (s 0 → s, t)π(a|s)d(s 0 ) , where p π (s 0 → s, t) is the probability of arriving at the state s after t time steps when starting from an initial state s 0 . This distribution is important as R(π) equals the expected reward r(s, a) under d π : R π) = E (s,a)∼d π ,r [r(s, a)]. Successor Representation. The successor representation (SR) (Dayan, 1993) of a policy is a measure of occupancy of future states. It can be viewed as a general value function that learns a vector of the expected discounted visitation for each state. The successor representation Ψ π of a given policy π is defined as Ψ π (s |s) = E π [ ∞ t=0 γ t 1(s t = s )|s 0 = s]. Importantly, the value function can be recovered from the SR by summing over the expected reward of each state V π (s) = s Ψ π (s |s)E a ∼π [r(s , a )]. For infinite state and action spaces, the SR can instead be generalized to the expected occupancy over features, known as the deep SR (Kulkarni et al., 2016) or successor features (Barreto et al., 2017) . For a given encoding function φ : S ×A → R n , the deep SR ψ π : S × A → R n is defined as the expected discounted sum over features from the encoding function φ when starting from a given state-action pair and following π: ψ π (s, a) = E π ∞ t=0 γ t φ(s t , a t ) s 0 = s, a 0 = a . (2)

