BATCH REINFORCEMENT LEARNING THROUGH CONTINUATION METHOD

Abstract

Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint, our method 1) helps the agent escape local optima; 2) reduces the error in policy evaluation in the optimization procedure. We present results on a variety of control tasks, game environments and a recommendation task to empirically demonstrate the efficacy of our proposed method.

1. INTRODUCTION

While RL is fundamentally an online learning paradigm, many practical applications of RL algorithms, e.g., recommender systems [5, 7] or autonomous driving [36] , fall under the batch RL setup. Under this setting, the agent is asked to learn its policy from a fixed set of interactions collected by a different (and possibly unknown) policy commonly referred to as the behavior policy, without the flexibility to gather new interactions. Realizing the interactive nature of online RL has been hindering its wider adoptions, researchers strive to bring these techniques offline [24, 11, 20, 23, 31, 12, 21, 2, 32, 8] . We focus on policy optimization under batch RL setup. As pointed out in [3, 26] , even with access to the exact gradient, the loss surface of the objective function maximizing the expected return is difficult to optimize, leading to slow convergence. Chen et al. [8] show that the objective function of expected return exhibits sub-optimal plateaus and exponentially many local optima in the worst case. Batch setup makes the learning even harder as it adds large variance to the gradient estimate, especially when the learned policy differs from the behavior policy used to generate the fixed trajectories. Recent works propose to constrain the size of the policy update [27, 28] or the distance between the learned policy and the behavior policy [14, 21] . The strength of that constraint is a critical hyperparameter that can be hard to tune [28] , as a loose constraint does not alleviate the distribution shift while a strict one results in conservative updates. Here we propose to address the challenges using continuation methods [35, 6, 17] . Continuation methods attempt to solve the global optimization problem by progressively solving a sequence of new objectives that can be optimized more efficiently and then trace back the solutions to the original one. We change the objective function of policy optimization by including an additional term penalizing the KL divergence between the parameterized policy ⇡ ✓ and the behavior policy. We then gradually decrease the weight of that penalty, eventually converging to optimizing the expected return. With this additional constraint, we benefit from more accurate policy evaluation in the early stage of training as the target policy is constrained to be close to the behavior policy. As training continues, we relax the constraint and allow for more aggressive improvement over the behavior policy as long as the policy evaluation is still stable and relatively reliable, i.e. with a small enough variance. By doing so, the proposed method exhaustively exploits the information in the collected trajectories while avoiding the overestimation of state-action pairs that lack support. The contributions of this paper are as follows: (1) We propose a soft policy iteration approach to batch RL through the continuation method. (2) We theoretically verify that in the tabular setting with exact gradients, maximizing KL regularized expected return leads to faster convergence than optimizing the expected return alone. Also, our method converges to the globally optimal policy if there are sufficient data samples for accurate value estimation. (3) We demonstrate the effectiveness of our method in reducing errors in value estimation using visualization; (4) We empirically verify the advantages of our method over existing batch RL methods on various complex tasks. 1 Batch Reinforcement Learning. Off-policy reinforcement learning has been extensively studied [11, 20, 30, 23, 31] , with many works [12, 21, 2] focusing on variants of Q-learning. Fujimoto et al. [12] , Kumar et al. [21] investigated the extrapolation error in batch RL resulting from the mismatch of state-action visitation distribution between the fixed dataset and the current policy, and proposed to address it by constraining the action distribution of the current policy from deviating much from the training dataset distribution. Recent works [29, 33] studied policy iteration under batch RL. The Q function is estimated in the policy evaluation step without special treatment while the policy updates are regularized to remain close to the prior policy with a fixed constraint. To further reduce uncertainty in Q learning, an ensemble of Q networks [21, 29] and distributional Q-function [2, 33] are introduced for the value estimation. [34, 18] use the KL divergence between the the target policy and the behavior policy as a regularization term in the policy update and/or value estimation. The constraint is controlled by a fixed weight of the KL regularization or a fixed threshold for the KL divergence. While all of these works apply a fixed constraint determined by a sensitive hyperparameter to control the distance between the behavior/prior policy and the target policy, we focus on gradually relaxed constraints. Constrained Policy Updates. Several works [27, 1, 15] studied constrained policy updates in online settings. Kakade & Langford [19] show that large policy updates can be destructive, and propose a conservative policy iteration algorithm to find an approximately optimal policy. Schulman et al. [27] constrain the KL divergence between the old policy and new policy to guarantee policy improvement in each update. Grau-Moya et al. [15] force the policy to stay close to a learned prior distribution over actions, deriving a mutual-information regularization between state and action. Cheng et al. [9] propose to regularize in the function space. Again these methods focused on a fixed constraint while we are interested in continuing relaxing the constraint to maximize the expected return eventually. Also none of these methods have been extensively tested for batch RL with fixed training data. Continuation Method. Continuation method [35] is a global optimization technique. The main idea is to transform a nonlinear and highly non-convex objective function to a series of smoother and easier to optimize objective functions. The optimization procedure is successively applied to the new functions that are progressively more complex and closer to the original non-context problem, to trace their solutions back to the original objective function. Chapelle et al. [6] use the continuation method to optimize the objective function of semi-supervised SVMs and reach lower test error compared with algorithms directly minimizing the original objective. Hale et al. [17] apply the continuation method to l1-regularized problems and demonstrate better performance for compressed sensing problems. Inspired by prior works, we employ the continuation method to transform the objective of batch RL problems by adding regularization. We gradually decrease the regularization weight to trace the solution back to the original problem.

3. METHOD

In classical RL, an agent interacts with the environment while updating its policy. At each step t, the agent observes a state s t 2 S, selects an action a t 2 A according to its policy to receive a reward r t = r(s t , a t ) : S ⇥ A ! R and transitions to the next state s t+1 ⇠ P(•|s t , a t ). The state value of a policy ⇡ at a state s is V ⇡ (s) = E s0=s,at⇠⇡(•|st),st+1⇠P(•|st,at) [ P 1 t=0 t r(s t , a t )]. 2 [0, 1] is the discounting factor. At each step, the agent updates the policy ⇡ so that the expected return V ⇡ (⇢) = E s⇠⇢ [V ⇡ (s)] (where ⇢ is the initial state distribution) is maximized. In batch RL, the agent is not allowed to interact with the environment during policy learning. Instead it has access to a fixed set of trajectories sampled from the environment according to a behavior policy 1 . A trajectory {(s 0 , a 0 , r 0 ), (s 1 , a 1 , r 1 ), • • • , (s T , a T , r T )} is generated by sampling s 0 from the initial state distribution ⇢, sampling the action a t ⇠ (•|s t ) at the state s t and moving to s t+1 ⇠ P(•|s t , a t ) for each step t 2 [0, 1, • • • , T ]. The length T can vary among trajectories. We then convert the generated trajectories to a dataset D = {(s i , a i , r i , s 0 i )} N i=1 , where s 0 i is the next state after s i in a trajectory. The goal of batch RL is to learn a parameterized policy ⇡ ✓ with the provided dataset to maximize the expected return V ⇡ (⇢). In Sec. 3.1, we will first introduce a new objective function Ṽ ⇡,⌧ (⇢), i.e. the expected return of policy ⇡ with KL regularization term and the regularization weight ⌧ . With exact gradients, Ṽ ⇡,⌧ (⇢) can be optimized more efficiently than the original objective V ⇡ (⇢). With the continuation method, solving a sequence of optimization problems for Ṽ ⇡,⌧ (⇢) with decaying value of ⌧ converges toward optimizing V ⇡ (⇢) and makes the optimization easier. In Sec. 3.2, we derive soft policy iteration with KL regularization to optimize Ṽ ⇡,⌧ (⇢), without the assumption of exact gradients. Finally, in Sec. 3.3, we propose a practical batch RL algorithm with value estimation for target policy based on this theory.

3.1. OPTIMIZING EXPECTED RETURN WITH KL REGULARIZATION

In batch RL, the distribution of the trajectories generated by the behavior policy can be very different from that of the learned policy. We thus restrict the learned policy to stay close to the behavior policy via the regularization of KL divergence. Define the soft state value of a policy ⇡ at a state s as Ṽ ⇡,⌧ (s) = E s0=s,at⇠⇡(•|st),st+1⇠P(•|st,at) " 1 X t=0 t ✓ r(s t , a t ) ⌧ log ⇡(a t |s t ) (a t |s t ) ◆ # , where the temperature parameter ⌧ controls the deviation from . The new objective function becomes Ṽ ⇡,⌧ (⇢) = E s⇠⇢ [ Ṽ ⇡,⌧ (s)]. This KL regularized objective differs from the original objective V ⇡ (⇢), which however can be recovered as ⌧ ! 0. As pointed out in [3] , even with exact gradients, the objective function V ⇡ (⇢) is still difficult to optimize due to its highly non-smooth landscape. Mei et al. [26] further prove that, in a tabular setting with softmax parameterized policy and exact gradients, the vanilla policy gradient method (i.e. directly updating the parameters of policy ⇡ to maximize V ⇡ (⇢) with gradient descent) converges to the global optimal policy at a convergence rate O(1/t), while the entropy-regularized policy gradient enjoys a significantly faster linear convergence rate O(e t ). Motivated by this line of work, we investigate the convergence rate of optimizing Ṽ ⇡,⌧ (⇢) with the exact gradient descent and compare it with the vanilla policy gradient method. We study the smoothness and Łojasiewicz inequality for the function Ṽ ⇡,⌧ (⇢) to prove the convergence rate, similar to [26] . The detailed proofs of all following theorems are provided in the appendix. Theorem 1. In the tabular setting with softmax parameterized policy ⇡ ✓ , maximizing Ṽ ⇡,⌧ (⇢) using policy gradient with the learning rate ⌘ = (8M +⌧ (4+8 log A)) , for all t > 1, we have Ṽ ⇡ ⇤ ⌧ ,⌧ (⇢) Ṽ ⇡ ✓ t ,⌧ (⇢)  C • e C⌧ (t 1) • M + ⌧ log A (1 ) 2 where ⇡ ⇤ ⌧ is the optimal policy maximizing Ṽ ⇡,⌧ (⇢), M is the bound of the absolute value of r(s, a)+ ⌧ log (a|s), A is the size of action space, S is the size of state space, C ⌧ / (8M/⌧ +4+8 log A)•S , and C is a constant independent with t and ⌧ . Theorem 1 states that KL regularized expected return Ṽ ⇡,⌧ (⇢) can be optimized with a convergence rate O(e t ) rather than the O(1/t), the convergence rate of vanilla policy gradient for expected return alone. The faster convergence inspires us to optimize Ṽ ⇡,⌧ (⇢) to reach policy ⇡ ⇤ ⌧ , then use ⇡ ⇤ ⌧ as initialization, gradually decrease the temperature ⌧ towards 0, and eventually move from We construct a toy example to illustrate this motivation. In the grid world (Fig. 1a ), the start state, annotated with 'S', is in the center and the terminal states are marked in yellow. There are only two states with positive rewards (0.9 and 1). There are four actions {up, down, left, right}. A badly initialized policy ⇡ ✓0 is shown as arrows in Fig. 1a ). The initialization results in a poor policy, having high tendency to go right toward a terminal state with zero reward. The vanilla policy gradient method (i.e. maximizing V ⇡ (⇢) with true gradient) starting from this initial point takes more than 7000 iterations to escape a sub-optimal solution (Fig. 1b ). In contrast, we escape the sub-optimal solution much faster when applying the continuation method to update the policy with the gradients of Ṽ ⇡,⌧ (⇢), where the behavior policy # Train the critic network 8: ⇡ ⇤ ⌧ to ⇡ ⇤ = arg max ⇡ V ⇡ (⇢). (•|s) = [u 1 , u 2 , u 3 , u 4 ] with u i , i = 1, Update (k) to minimize the temporal difference 1 B P B i=1 r i + V (s 0 i ) Q (k) (s i , a i ) where V (s) = 1 K P K k=1 E a⇠⇡ ✓ (•|s) (Q (k) (s, a)) ⌧ KL(⇡ ✓ (•|s)| (•|s)) 10: # Train the actor network 11: Update ✓ to maximize 1 B P B i=1 h 1 K P K k=1 E a⇠⇡ ✓ (•|si) (Q (k) (s i , a)) ⌧ KL(⇡ ✓ (•|s i )| (•|s i )) i 12: # Decay the weight of KL regularization ⌧ for every I updates 13: if j mod I = 0 then 14: ⌧ ⌧ ⇤ 15: end if 16: end for normalized for each state s. In Fig. 1b , as we decrease ⌧ , the value of learned policy ⇡ ✓i for each iteration i quickly converges to the optimal value. In other words, optimizing a sequence of objective functions Ṽ ⇡,⌧ (⇢) can reach the optimal solution for V ⇡ (⇢) significantly faster.

3.2. SOFT POLICY ITERATION WITH KL REGULARIZATION

As explained in the previous section, we focus on the new objective function Ṽ ⇡,⌧ (⇢), which can be optimized more efficiently, and use continuation method to relax toward optimizing V ⇡ (⇢). Batch RL adds the complexity of estimating the gradient of Ṽ ⇡,⌧ (⇢) with respect to ⇡ from a fixed set of trajectories. We propose to adapt soft actor-critic [16] , a general algorithm to learn optimal maximum entropy policies in batch RL for our use case. We change the entropy regularization to KL regularization and derive the soft policy iteration to learn KL regularized optimal policy. For a policy ⇡ and temperature ⌧ , the soft state value is defined in Eq. 1 and soft Q function is defined as: Q⇡,⌧ (s, a) = r(s, a) + E s 0 ⇠P(•|s,a) Ṽ ⇡,⌧ (s 0 ) In the step of soft policy evaluation, we aim to compute the value of policy ⇡ according to the minimum KL divergence objective Ṽ ⇡,⌧ (⇢) = E s⇠⇢ [ Ṽ ⇡,⌧ (s)]. According to Lemma 1 in Appendix, the soft Q value can be computed by repeatedly applying the soft bellman backup operator. T ⇡,⌧ Q(s, a) = r(s, a)+ E s 0 ⇠P(•|s,a) (V (s 0 )), where V (s) = E a⇠⇡(•|s)  Q(s, a) ⌧ log ⇡(a|s) (a|s) . In the step of policy improvement, we maximize the expected return based on Q-value evaluation with the KL divergence regularization. The following policy update can be guaranteed to result in an improved policy in terms of its soft value (Lemma 2 in Appendix). ⇡ new (•|s) = arg max ⇡2⇧ h E a⇠⇡(•|s) ⇣ Q⇡ old ,⌧ (s, a) ⌘ ⌧ KL(⇡(•|s)| (•|s)) i where KL(⇡(•|s)| (•|s)) = E a⇠⇡(•|s)  log ⇡(a|s) The soft policy iteration algorithm alternates between the soft policy evaluation and soft policy improvement, and it will provably converge to the optimal policy maximizing the objective Ṽ ⇡,⌧ (⇢). Theorem 2. Repeated application of soft policy evaluation and soft policy improvement converges to a policy ⇡ ⇤ ⌧ such that Q⇡ ⇤ ⌧ ,⌧ (s, a) Q⇡,⌧ (s, a) for any ⇡ 2 ⇧ and (s, a) 2 S ⇥ A. The soft policy iteration finds a policy ⇡ ⇤ ⌧ with optimal soft Q value for each state-action pair and hence gets the optimal value of Ṽ ⇡,⌧ (⇢). Here we propose to use the soft policy iteration to solve objectives Ṽ ⇡,⌧ (⇢) with decreasing value of ⌧ and move back to the objective V ⇡ (⇢) as ⌧ = 0. The method is guaranteed to asymptotically converge to the optimal policy ⇡ ⇤ for the objective V ⇡ (⇢). In the first four columns, triangles represent the error for actions that move in different directions. Darker color indicates higher error. To investigate the performance of the learned policy ⇡ ✓ 1000 , the length of arrows represents the probability of taking each actions in each states. We run ⇡ ✓ 1000 in the grid world and visualize the visitation count in the last column (heatmap). Darker color means more visitation. Theorem 3. Let ⇡ ⇤ ⌧ (a|s) be the optimal policy from soft policy iteration with fixed temperature ⌧ . We have ⇡ ⇤ ⌧ (a|s) / exp ⇣ Q⇡ ⇤ ⌧ ,⌧ (s,a) ⌧ ⌘ (a|s). As ⌧ ! 0, ⇡ ⇤ ⌧ (a|s) will take the optimal action a ⇤ with optimal Q value for state s.

3.3. ERROR IN VALUE ESTIMATE

In the previous section, we show that the soft policy iteration with the continuation method provably converges to the global optimal policy maximizing expected return. However, in batch RL with a fixed dataset and limited samples, we cannot perform the soft policy iteration with KL regularization in its exact form. Specifically, in the policy evaluation step, when the learned policy ⇡ deviates for the behavior policy , and chooses the state-action pair (s, a) rarely visited by , the estimation of target r(s, a) + E s 0 ⇠P(•|s,a) (V (s 0 )) can be very noisy. The error in the value estimate Q(s, a) will be further propagated to other state-action pairs through the bellman update. Finally, inaccurate value estimation will cause errors in the policy improvement step, resulting in a worse policy. On the other hand, if we constrain the learned policy ⇡ to be very close to the behavior policy , we can expect the policy evaluation to be reliable and safely update the learned policy. The tight constraint however prevents ⇡ to be much better than due to the conservative update. On the grid world, we study this problem of value estimation with different values of ⌧ . Figure 2 visualizes the propagation of Q value estimation errors and the learned policies. We assume a mediocre behavior policy tending to move left and down. For the rarely visited states in the upper right part of the grid, there are errors in the value estimation of Q⇡,⌧ (s, a), i.e. |Q(s, a) Q⇡,⌧ (s, a)| > 0 where Q(s, a) is the Q value we learn during training and Q⇡,⌧ (s, a) is the ground truth soft Q value. Because the bad initial policy (Fig. 1a ) tends to move towards the right part, without a strong KL regularization, the policy evaluation can be problematic due to the errors of value estimation in the right part of the grid world. In Fig. 2 , with a small KL regularization weight ⌧ = 0.001, the first row shows that errors even propagate to the frequently visited states by the behavior policy. On the other hand, when we set a large value of ⌧ = 1 (second row), the error |Q(s, a) Q⇡,⌧ (s, a)| is smaller. Yet the performance of the learned policy is not much better than the behavior policy. Our continuation method gradually moves the policy update between these two spectra. The value estimation benefits from the gradually relaxed KL regularization and the errors remain small. The last column of Fig. 2 visualizes the learned policy in these methods. With constant ⌧ = 0.001, the wrong value estimates in some states mislead the agent. It fails to visit any terminal state and gets stuck at the state in dark orange. With constant ⌧ = 1, the tight constraint of KL divergence makes the learned policy close to the behavior policy, mostly visiting the left bottom part of the environment. With continuation method, the agent learns to always take the optimal path moving left directly and obtains the highest expected return. More details of this example are provided in the appendix. In the toy example, gradually relaxing KL regularization towards zero alleviates the propagation of errors in the soft Q estimate and helps the agent converge to the optimal policy. In more complicated domains, we find that as ⌧ decays close to 0, the policy evaluation is still erroneous. To mitigate this issue, we introduce an ensemble of critic networks {Q (1) , Q (2) , • • • , Q (K) } to approximate the soft Q value, and monitor the variance of value estimation in different critic networks to measure the uncertainty. Given a batch of data samples {s i } B i=1 ⇢ D, var(Q ⇡ ) = 1 B P B i=1 E a⇠⇡(•|si) var(Q (1) (s i , a), Q (2) (s i , a), • • • , Q (k) (s i , a )) indicates whether the current policy ⇡ tends to take actions with highly noisy value estimation. Our method is summarized in Algorithm 1. Instead of running the soft policy evaluation and policy improvement until convergence, we alternate between optimizing the critic network and actor network with stochastic gradient descent. We set ⌧ to large value initially and let the KL divergence term dominate the objective, thus performing behavior cloning. We record a moving average of the Q value estimation variance var(Q ⇡,⌧0 ) over 1000 updates at the end of the phase. After that, we decay the temperature gradually with = 0.9 every I steps. When the moving average of the Q value estimation variance var(Q ⇡,⌧ ) is large compared with the initial value var(Q ⇡,⌧0 ) (i.e. at the end of behavior cloning), we no longer trust the value estimate under the current temperature ⌧ and take the policy checkpointed before the temperature decays to this ⌧ as our solution.

4.1. MUJOCO

We evaluate our method with several baselines on continuous control tasks. We train a Proximal Policy Optimization agent [28] with entropy regularization for 1000 million steps in the environments. We parameterize the policy using Gaussian policies where the mean is a linear function of the agent's state ✓ T s and variance is an identity matrix to keep the policy simple as introduced in [3] . To generate training datasets D with varying quality, we construct the behavior policy by mixing the well-trained policy N (✓ T opt s, 0.5I), i.e. checkpoint with the highest score during training, and a poor policy N (✓ T 0 s, 0.5I) , i.e. checkpoint at the beginning of the training, with the weight ↵. Then the behavior policy (•|s) is N (((1 ↵)✓ opt + ↵✓ 0 ) T s, 0.5I). We generate trajectories and store a total of one million data samples from the mixed behavior for different values of the coefficient ↵. The architecture of the target policy is the same as the behavior policy. We consider six baseline approaches: BCQ [14] , BEAR [21] , ABM+SVG [29] , CRR [33] , CQL [22] , BRAC [34] . For a fair comparison, the architectures of the ensemble critic network and the policy network are the same in the baselines and our method, except BCQ which has no policy network. To evaluate and compare the methods, we run the learned policy in the environments for 100 episodes and report the average episode reward in Fig 3 . As for the continuation method, we report the score of policy checkpointed last with reasonable value estimation variance, as explained in Section 3.3. For the baselines, we report the score of the final policy when we terminate the training at 1.5M updates. Tab. 1 shows that our method outperforms all the baselines on 5 settings. On the dataset with relatively reasonable quality (i.e. ↵ = 0.2, 0.4, 0.6), ours performs comparable or better than the baselines. With ↵ = 0.2, i.e., close to optimal behavior policy, all the methods perform similarly and one can achieve a good return by simply cloning the behavior policy. With ↵ = 0.8, i.e., low-quality behavior policy, there are few good trajectories in the dataset for any methods to learn. The advantage of our method is most obvious when ↵ = 0.6 (Fig. 3 ), as the dataset contains trajectories of both high and low cumulative rewards. Our method can learn from the relatively large number of good trajectories and at the same time deviate from the behavior policy to avoid those bad trajectories and achieve higher rewards. In Fig. 3 , 'Constant' is the method of optimizing KL regularized expected reward with constant value of ⌧ . We search several values of ⌧ and report the best result. We can see that gradually relaxing constraint performs better than the fixed constraint. In Fig. 3 (left), as ⌧ decays to close to 0, the learned policy can degrade due to errors in the Q estimation, the stopping condition explained in Section 3.3 is however able to identify a good policy before the degenerating point. More experimental details are in the Appendix.

4.2. ATARI

We further study our method on several Atari games from the Arcade Learning Environment (ALE) [4] . The rich observation space requires more complicated policies and makes policy optimization even more challenging. We focus on eight games and generate the datasets as discussed in Fujimoto et al. [13] . We use a mediocre DQN agent, trained online for 10 million timesteps (40 million frames). The performance of the DQN agent is shown as 'Online DQN' in Fig. 4 . We add exploratory noise on the DQN agent (at 10 million timesteps) to gather a new set of 10 million transitions, similar to [13] . The line "Behavior" in Fig. 4 shows the average of trajectory reward in the dataset D. The dataset D is used to train each offline RL agent. We compare with BCQ [13] , REM [2] and CQL [22] because they are recently proposed offline RL algorithms and work well on Atari domain. For evaluation, we run 10 episodes on the Atari games with the learned policies and record the average episode reward (Fig. 4 ). Tab. 2 summarizes the performance of BCQ, REM and CQL after 6M updates. For our method, we report the score before the variance of Q estimate becomes too high. Our approach achieves higher scores than the baselines on 7 out of 8 games, and perform comparably on the other one. Agarwal et al. [2] reports that REM performs well in the dataset consisting of the entire replay experiences collected in the online training of the DQN agent for 50M timesteps (200M frames). We hypothesize that learning on the entire replay experience makes the setup easier as the training dataset contains more exploratory and higher quality trajectories. With the dataset of much smaller size and worse quality, REM performs poorly in this single behavioral policy setting. We use the same architecture of the critic network for both our method and BCQ with ensemble of 4 Q networks. As mentioned in [13] , BCQ only matches the performance of the online DQN on most games. In contrast, ours is able to outperform online DQN significantly on several games. As presented in [22] , on the Atari dataset, CQL performs better than REM, while our method outperforms CQL in 7 out of 8 datasets.



If the behavior policy is not known in advance, it can be fitted from the data[30,7].



With a reasonable value of ⌧ , we enjoy a linear convergence rate toward ⇡ ⇤ ⌧ from the randomly initialized policy ⇡ ✓ . As ⌧ decreases, ⇡ ⇤ ⌧ gets closer to ⇡ ⇤ . The final optimization of V ⇡ ✓ (⇢) from ⇡ ⇤ ⌧ can be much faster than from a randomly initialized ⇡ ✓ . (a) A grid world with sparse rewards. (b) Learning curve of the value of learned policy ⇡ ✓ i . We conduct a hyper-parameter search for the learning rate 5,1,0.5,0.1,0.05,0.01,0.005,0.001 and report the best performance for each method.

2, 3, 4 randomly sampled from U [0, 1] and Algorithm 1 Soft Policy Iteration through Continuation Method 1: Initialize: actor network ⇡ ✓ , ensemble critic network {Q (1) , Q (2) , • • • , Q (K) }, behavior policy network , penalty coefficient ⌧ , decay rate , number of iterations I for each ⌧ 2: Input: training dataset D = {(s i , a i , r i , s 0 i )} N i=0 for update j = 0, 1, • • • do Sample batch of data {(s i , a i , r i )} B i=1 from D Learn the behavior policy with behavior cloning objective

Figure2: Visualization of the error in soft Q value estimation and quality of the learned policy. In the first four columns, triangles represent the error for actions that move in different directions. Darker color indicates higher error. To investigate the performance of the learned policy ⇡ ✓ 1000 , the length of arrows represents the probability of taking each actions in each states. We run ⇡ ✓ 1000 in the grid world and visualize the visitation count in the last column (heatmap). Darker color means more visitation.

Figure 3: Learning curves of average reward over 5 runs on Mujoco tasks with ↵ = 0.6. The shaded area with light color indicates the standard deviation of the reward. The gray vertical lines on the yellow curves indicate where we take the checkpointed policy, according to the measure of Q value variance, as our final solution.

Results on Mujoco. We show the average and standard deviation of the scores in 5 independent runs.

Results on Atari, the mean and standard deviation of scores achieved in 3 indenpendent runs.

5. CONCLUSION

We propose a simple yet effective approach, soft policy iteration algorithm through continuation method to alleviate two challenges in policy optimization under batch reinforcement learning: (1) highly non-smooth objective function which is difficult to optimize (2) high variance in value estimates. We provide theoretical ground and visualization tools to help understand this technique. We demonstrate its efficacy on multiple complex tasks.

annex

We report the mean and standard deviation of the precision over 20 independent runs,

4.3. RECOMMENDER

We also showcase our proposed method for building a softmax recommender agent. We use a publicly available dataset MovieLens-1M, a popular benchmark for recommender system. There are 1 million ratings of 3,900 movies (with the title and genre features) from 6,040 users (with demographic features). The problem of recommending movies for each user can be converted to a contextual bandit problem, where we aim to learn a target policy ⇡ ✓ (a|s) selecting the proper action (movie) a for each state (user) s to get a high reward (rating) r in a single step. The ratings of 5-score are converted to binary rewards using a cutoff of 4. To evaluate whether a learned target policy works well, ideally we should run the learned policy in real recommendation environments. However, such environments for online test are rarely publicly available. Thus, we use the online simulation method. We train a simulator to predict the immediate binary feedback from user and movie features, and the well-trained simulator can serve as a proxy of the real online environments, because it outputs the feedback for any user-movie pair. Similar to [25] , we train the simulator with all records of logged feedback in MovieLens-1M dataset. The behavior policy is trained with partial data in MovieLens-1M. We then construct the bandit datasetsof different size and quality, by using different behavior policies to select movies a i for users s i and getting the binary feedback r i from the well-trained simulator. We train offline RL agents on the generated dataset D and use the simulator to evaluate the learned policies on a held-out test set of users.We compare our method with two baselines, as they are commonly used in current industrial recommender systems [10, 7] . (1) Cross-Entropy: a supervised learning method for the softmax recommender where the learning objective is the cross-entropy loss J CE (✓) = 1 N P N i=1 r i log ⇡ ✓ (a i |s i ) (2) IPS: the off-policy policy gradient method introduced in [7] with the learning objectivewhere sg indicates a stop-gradient operation. J IP S (✓) produces the same gradient as that of the function -(ai|si) . Thus, minimizing the loss J IP S (✓) is to maximizing the expected return with importance sampling. (3) Ours: in the bandit setting, we simply perform IPS with gradually decaying KL regularization since estimating the soft Q from bellman update is not needed. Tab. 3 clearly demonstrates the advantage of our proposed method over the baselines. IPS can be viewed as vanilla policy gradient with importance sampling to correct the distribution shift. Our method clearly outperforms it across datasets collected using different behavior policies.

