BATCH REINFORCEMENT LEARNING THROUGH CONTINUATION METHOD

Abstract

Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint, our method 1) helps the agent escape local optima; 2) reduces the error in policy evaluation in the optimization procedure. We present results on a variety of control tasks, game environments and a recommendation task to empirically demonstrate the efficacy of our proposed method.

1. INTRODUCTION

While RL is fundamentally an online learning paradigm, many practical applications of RL algorithms, e.g., recommender systems [5, 7] or autonomous driving [36] , fall under the batch RL setup. Under this setting, the agent is asked to learn its policy from a fixed set of interactions collected by a different (and possibly unknown) policy commonly referred to as the behavior policy, without the flexibility to gather new interactions. Realizing the interactive nature of online RL has been hindering its wider adoptions, researchers strive to bring these techniques offline [24, 11, 20, 23, 31, 12, 21, 2, 32, 8] . We focus on policy optimization under batch RL setup. As pointed out in [3, 26] , even with access to the exact gradient, the loss surface of the objective function maximizing the expected return is difficult to optimize, leading to slow convergence. Chen et al. [8] show that the objective function of expected return exhibits sub-optimal plateaus and exponentially many local optima in the worst case. Batch setup makes the learning even harder as it adds large variance to the gradient estimate, especially when the learned policy differs from the behavior policy used to generate the fixed trajectories. Recent works propose to constrain the size of the policy update [27, 28] or the distance between the learned policy and the behavior policy [14, 21] . The strength of that constraint is a critical hyperparameter that can be hard to tune [28] , as a loose constraint does not alleviate the distribution shift while a strict one results in conservative updates. Here we propose to address the challenges using continuation methods [35, 6, 17] . Continuation methods attempt to solve the global optimization problem by progressively solving a sequence of new objectives that can be optimized more efficiently and then trace back the solutions to the original one. We change the objective function of policy optimization by including an additional term penalizing the KL divergence between the parameterized policy ⇡ ✓ and the behavior policy. We then gradually decrease the weight of that penalty, eventually converging to optimizing the expected return. With this additional constraint, we benefit from more accurate policy evaluation in the early stage of training as the target policy is constrained to be close to the behavior policy. As training continues, we relax the constraint and allow for more aggressive improvement over the behavior policy as long as the policy evaluation is still stable and relatively reliable, i.e. with a small enough variance. By doing so, the proposed method exhaustively exploits the information in the collected trajectories while avoiding the overestimation of state-action pairs that lack support. The contributions of this paper are as follows: (1) We propose a soft policy iteration approach to batch RL through the continuation method. (2) We theoretically verify that in the tabular setting with exact gradients, maximizing KL regularized expected return leads to faster convergence than optimizing the expected return alone. Also, our method converges to the globally optimal policy if there are sufficient data samples for accurate value estimation. (3) We demonstrate the effectiveness of our method in reducing errors in value estimation using visualization; (4) We empirically verify the advantages of our method over existing batch RL methods on various complex tasks. 1 Batch Reinforcement Learning. Off-policy reinforcement learning has been extensively studied [11, 20, 30, 23, 31] , with many works [12, 21, 2] focusing on variants of Q-learning. Fujimoto et al. [12], Kumar et al. [21] investigated the extrapolation error in batch RL resulting from the mismatch of state-action visitation distribution between the fixed dataset and the current policy, and proposed to address it by constraining the action distribution of the current policy from deviating much from the training dataset distribution. Recent works [29, 33] studied policy iteration under batch RL. The Q function is estimated in the policy evaluation step without special treatment while the policy updates are regularized to remain close to the prior policy with a fixed constraint. To further reduce uncertainty in Q learning, an ensemble of Q networks [21, 29] and distributional Q-function [2, 33] are introduced for the value estimation. [34, 18] use the KL divergence between the the target policy and the behavior policy as a regularization term in the policy update and/or value estimation. The constraint is controlled by a fixed weight of the KL regularization or a fixed threshold for the KL divergence. While all of these works apply a fixed constraint determined by a sensitive hyperparameter to control the distance between the behavior/prior policy and the target policy, we focus on gradually relaxed constraints. Constrained Policy Updates. Several works [27, 1, 15] studied constrained policy updates in online settings. Kakade & Langford [19] show that large policy updates can be destructive, and propose a conservative policy iteration algorithm to find an approximately optimal policy. Schulman et al. [27] constrain the KL divergence between the old policy and new policy to guarantee policy improvement in each update. force the policy to stay close to a learned prior distribution over actions, deriving a mutual-information regularization between state and action. Cheng et al. [9] propose to regularize in the function space. Again these methods focused on a fixed constraint while we are interested in continuing relaxing the constraint to maximize the expected return eventually. Also none of these methods have been extensively tested for batch RL with fixed training data. Continuation Method. Continuation method [35] is a global optimization technique. The main idea is to transform a nonlinear and highly non-convex objective function to a series of smoother and easier to optimize objective functions. The optimization procedure is successively applied to the new functions that are progressively more complex and closer to the original non-context problem, to trace their solutions back to the original objective function. Chapelle et al. [6] use the continuation method to optimize the objective function of semi-supervised SVMs and reach lower test error compared with algorithms directly minimizing the original objective. Hale et al. [17] apply the continuation method to l1-regularized problems and demonstrate better performance for compressed sensing problems. Inspired by prior works, we employ the continuation method to transform the objective of batch RL problems by adding regularization. We gradually decrease the regularization weight to trace the solution back to the original problem.

3. METHOD

In classical RL, an agent interacts with the environment while updating its policy. At each step t, the agent observes a state s t 2 S, selects an action a t 2 A according to its policy to receive a reward r t = r(s t , a t ) : S ⇥ A ! R and transitions to the next state s t+1 ⇠ P(•|s t , a t ). The state value of a policy ⇡ at a state s is V ⇡ (s) = E s0=s,at⇠⇡(•|st),st+1⇠P(•|st,at) [ P 1 t=0 t r(s t , a t )]. 2 [0, 1] is the discounting factor. At each step, the agent updates the policy ⇡ so that the expected return V ⇡ (⇢) = E s⇠⇢ [V ⇡ (s)] (where ⇢ is the initial state distribution) is maximized. In batch RL, the agent is not allowed to interact with the environment during policy learning. Instead it has access to a fixed set of trajectories sampled from the environment according to a behavior policyfoot_0 . A trajectory {(s 0 , a 0 , r 0 ), (s 1 , a 1 , r 1 ), • • • , (s T , a T , r T )} is generated by sampling s 0 from the initial state distribution ⇢, sampling the action a t ⇠ (•|s t ) at the state s t and moving to s t+1 ⇠ P(•|s t , a t ) for each step t 2 [0, 1, • • • , T ]. The length T can vary among trajectories. We then convert the generated trajectories to a dataset D = {(s i , a i , r i , s 0 i )} N i=1 , where s 0 i is the next state after s i in a trajectory. The goal of batch RL is to learn a parameterized policy ⇡ ✓ with the provided dataset to maximize the expected return V ⇡ (⇢). In Sec. 3.1, we will first introduce a new objective function Ṽ ⇡,⌧ (⇢), i.e. the expected return of policy ⇡ with KL regularization term and the regularization weight ⌧ . With exact gradients, Ṽ ⇡,⌧ (⇢) can be optimized more efficiently than the original objective V ⇡ (⇢). With the



If the behavior policy is not known in advance, it can be fitted from the data[30, 7].

