Q-VALUE WEIGHTED REGRESSION: REINFORCEMENT LEARNING WITH LIMITED DATA

Abstract

Sample efficiency and performance in the offline setting have emerged as among the main challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline setting, but struggles on tasks with discrete actions and in sample efficiency. We perform a theoretical analysis of AWR that explains its shortcomings and use the insights to motivate QWR theoretically. We show experimentally that QWR matches state-of-the-art algorithms both on tasks with continuous and discrete actions. We study the main hyperparameters of QWR and find that it is stable in a wide range of their choices and on different tasks. In particular, QWR yields results on par with SAC on the MuJoCo suite and -with the same set of hyperparametersyields results on par with a highly tuned Rainbow implementation on a set of Atari games. We also verify that QWR performs well in the offline RL setting, making it a compelling choice for reinforcement learning in domains with limited data.

1. INTRODUCTION

Deep reinforcement learning has been applied to a large number of challenging tasks, from games (Silver et al., 2017; OpenAI, 2018; Vinyals et al., 2017) to robotic control (Sadeghi & Levine, 2016; OpenAI et al., 2018; Rusu et al., 2016) . Since RL makes minimal assumptions on the underlying task, it holds the promise of automating a wide range of applications. However, its widespread adoption has been hampered by a number of challenges. Reinforcement learning algorithms can be substantially more complex to implement and tune than standard supervised learning methods and can have a fair number of hyper-parameters and be brittle with respect to their choices, and may require a large number of interactions with the environment. These issues are well-known and there has been significant progress in addressing them. The policy gradient algorithm REINFORCE (Williams (1992) ) is simple to understand and implement, but is both brittle and requires on-policy data. Proximal Policy Optimization (PPO, Schulman et al. (2017) ) is a more stable on-policy algorithm that has seen a number of successful applications despite requiring a large number of interactions with the environment. Soft Actor-Critic (SAC, Haarnoja et al. (2018) ) is a much more sample-efficient off-policy algorithm, but it is defined only for continuous action spaces and does not work well in the offline setting, known as batch reinforcement learning, where all samples are provided from earlier interactions with the environment, and the agent cannot collect more samples. Advantage Weighted Regression (AWR, Peng et al. ( 2019)) is a recent offpolicy actor-critic algorithm that works well in the offline setting and is built using only simple and convergent maximum likelihood loss functions, making it easier to tune and debug. It is competitive with SAC given enough time to train, but is less sample-efficient and has not been demonstrated to succeed in settings with discrete actions. We replace the value function critic of AWR with a Q-value function. Next, we add action sampling to the actor training loop. Finally, we introduce a custom backup to the Q-value training. The resulting algorithm, which we call Q-Value Weighted Regression (QWR) inherits the advantages of AWR but is more sample-efficient and works well with discrete actions and in visual domains, e.g., on Atari games. To better understand QWR we perform a number of ablations, checking different number of samples in actor training, different advantage estimators, and aggregation functions. These choices affect the performance of QWR only to a limited extent and it remains stable with each of the choices across the tasks we experiment with. We run experiments with QWR on the MuJoCo environments and on a subset of the Atari Learning Environment. Since sample efficiency is our main concern, we focus on the difficult case when the number of interactions with the environment is limited -in most our experiments we limit it to 100K interactions. The experiments demonstrate that QWR is indeed more sample-efficient than AWR. On MuJoCo, it performs on par with Soft Actor-Critic (SAC), the current state-of-the-art algorithm for continuous domains. On Atari, QWR performs on par with OTRainbow, a variant of Rainbow highly tuned for sample efficiency. Notably, we use the same set of hyperparameters (except for the network architecture) for both our final MuJoCo and Atari experiments.  φ ← φ -α V ∇ φ ||R s,a D -V φ (s)|| 2 9: end for 10: for i in 0..n_actor_steps -1 do 11: sample (s, a) ∼ D 12: θ ← θ + α π ∇ θ log π θ (a|s) exp( 1 β (R s,a D -V φ (s)) 13: end for 14: end for AWR optimizes expected improvement of an actor policy π(a|s) over a sampling policy µ(a|s) by regression towards the well-performing actions in the collected experience. Improvement is achieved by weighting the actor loss by exponentiated advantage A µ (s, a) of an action, skewing the regression towards the better-performing actions. The advantage is calculated based on the expected return R s,a µ achieved by performing action a in state s and then following the sampling policy µ. To calculate the advantage, one first estimates the value, V µ (s), using a learned critic and then computes A µ (s, a) = R s,a µ -V µ (s). This results in the following formula for the actor: arg max π E s∼dµ(s) E a∼µ(•|s) log π(a|s) exp 1 β (R s,a µ -V µ (s)) , where d µ (s) = ∞ t=1 γ t-1 p(s t = s|µ) denotes the unnormalized, discounted state visitation distribution of the policy µ, and β is a temperature hyperparameter. The critic is trained to estimate the future returns of the sampling policy µ: arg min V E s∼dµ(s) E a∼µ(•|s) ||R s,a µ -V (s)|| 2 . ( ) To achieve off-policy learning, the actor and the critic are trained on data collected from a mixture of policies from different training iterations, stored in the replay buffer D.



WEIGHTED REGRESSION   Peng et al. (2019)  recently proposed Advantage Weighted Regression (AWR), an off-policy, actorcritic algorithm notable for its simplicity and stability, achieving competitive results across a range of continuous control tasks. It can be expressed as interleaving data collection and two regression tasks performed on the replay buffer, as shown in Algorithm 1.Algorithm 1 Advantage Weighted Regression.1: θ ← random actor parameters 2: φ ← random critic parameters 3: D ← ∅ 4: for k in 0..n_iterations -1 do 5:add trajectories {τ i } sampled by π θ to D

