AWAC: ACCELERATING ONLINE REINFORCEMENT LEARNING WITH OFFLINE DATASETS

Abstract

Reinforcement learning provides an appealing formalism for learning control policies from experience. However, the classic active formulation of reinforcement learning necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings. If we can instead allow reinforcement learning to effectively use previously collected data to aid the online learning process, where the data could be expert demonstrations or more generally any prior experience, we could make reinforcement learning a substantially more practical tool. While a number of recent methods have sought to learn offline from previously collected data, it remains exceptionally difficult to train a policy with offline data and improve it further with online reinforcement learning. In this paper we systematically analyze why this problem is so challenging, and propose an algorithm that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of reinforcement learning policies. We show that our method enables rapid learning of skills with a combination of prior demonstration data and online experience across a suite of difficult dexterous manipulation and benchmark tasks.

1. INTRODUCTION

Learning models that generalize effectively to complex open-world settings, from image recognition (Krizhevsky et al., 2012) to natural language processing (Devlin et al., 2019) , relies on large, high-capacity models and large, diverse, and representative datasets. Leveraging this recipe for reinforcement learning (RL) has the potential to yield real-world generalization for control applications such as robotics. However, while deep RL algorithms enable the use of large models, the use of large datasets for real-world RL has proven challenging. Most RL algorithms collect new data online every time a new policy is learned, which limits the size and diversity of the datasets for RL. In the same way that powerful models in computer vision and NLP are often pre-trained on large, general-purpose datasets and then fine-tuned on task-specific data, RL policies that generalize effectively to open-world settings will need to be able to incorporate large amounts of prior data effectively into the learning process, while still collecting additional data online for the task at hand. For data-driven reinforcement learning, offline datasets consist of trajectories of states, actions and associated rewards. This data can potentially come from demonstrations for the desired task (Schaal, 1997; Atkeson & Schaal, 1997 ), suboptimal policies (Gao et al., 2018) , demonstrations for related tasks (Zhou et al., 2019) , or even just random exploration in the environment. Depending on the quality of the data that is provided, useful knowledge can be extracted about the dynamics of the world, about the task being solved, or both. Effective data-driven methods for deep reinforcement learning should be able to use this data to pre-train offline while improving with online fine-tuning. Since this prior data can come from a variety of sources, we would like to design an algorithm that does not utilize different types of data in any privileged way. For example, prior methods that incorporate demonstrations into RL directly aim to mimic these demonstrations (Nair et al., 2018) , which is desirable when the demonstrations are known to be optimal, but imposes strict requirements on the type of offline data, and can cause undesirable bias when the prior data is not optimal. While prior methods for fully offline RL provide a mechanism for utilizing offline data (Fujimoto et al., 2019; Kumar et al., 2019) , as we will show in our experiments, such methods generally are not effective for fine-tuning with online data as they are often too conservative. In effect, prior methods require us to choose: Do we assume prior data is optimal or not? Do we use only offline data, or only online data? To make it feasible to learn policies for open-world settings, we need algorithms that learn successfully in any of these cases. In this work, we study how to build RL algorithms that are effective for pre-training from offpolicy datasets, but also well suited to continuous improvement with online data collection. We systematically analyze the challenges with using standard off-policy RL algorithms (Haarnoja et al., 2018; Kumar et al., 2019; Abdolmaleki et al., 2018) for this problem, and introduce a simple actor critic algorithm that elegantly bridges data-driven pre-training from offline data and improvement with online data collection. Our method, which uses dynamic programming to train a critic but a supervised learning style update to train a constrained actor, combines the best of supervised learning and actor-critic algorithms. Dynamic programming can leverage off-policy data and enable sample-efficient learning. The simple supervised actor update implicitly enforces a constraint that mitigates the effects of distribution shift when learning from offline data (Fujimoto et al., 2019; Kumar et al., 2019) , while avoiding overly conservative updates. We evaluate our algorithm on a wide variety of robotic control and benchmark tasks across three simulated domains: dexterous manipulation, tabletop manipulation, and MuJoCo control tasks. Our algorithm, Advantage Weighted Actor Critic (AWAC), is able to quickly learn successful policies on difficult tasks with high action dimension and binary sparse rewards, significantly better than prior methods for off-policy and offline reinforcement learning. Moreover, AWAC can utilize different types of prior data without any algorithmic changes: demonstrations, suboptimal data, or random exploration data. The contribution of this work is not just another RL algorithm, but a systematic study of what makes offline pre-training with online fine-tuning unique compared to the standard RL paradigm, which then directly motivates a simple algorithm, AWAC, to address these challenges.

2. PRELIMINARIES

We consider the standard reinforcement learning notation, with states s, actions a, policy π(a|s), rewards r(s, a), and dynamics p(s |s, a). The discounted return is defined as R t = T i=t γ i r(s i , a i ), for a discount factor γ and horizon T which may be infinite. The objective of an RL agent is to maximize the expected discounted return J(π) = E pπ(τ ) [R 0 ] under the distribution induced by the policy. The optimal policy can be learned directly by policy gradient, estimating ∇J(π) (Williams, 1992) , but this is often ineffective due to high variance of the estimator. Many algorithms attempt to reduce this variance by making use of the value function (1) Instead of estimating policy gradients directly, actor-critic algorithms maximize returns by alternating between two phases (Konda & Tsitsiklis, 2000) : policy evaluation and policy improvement. During the policy evaluation phase, the critic Q π (s, a) is estimated for the current policy π. This can be accomplished by repeatedly applying the Bellman operator B, corresponding to the right-hand side of Equation 1, as defined below: V π (s) = E pπ(τ ) [R t |s], action-value function Q π (s, a) = E pπ(τ ) [R t |s, a], or advantage A π (s, a) = Q π (s, a) -V π (s). B π Q(s, a) = r(s, a) + γE p(s |s,a) [E π(a |s ) [Q π (s , a )]]. By iterating according to Barto, 1998) . With function approximation, we cannot apply the Bellman operator exactly, and instead minimize the Bellman error with respect to Q-function parameters φ k : Q k+1 = B π Q k , Q k converges to Q π (Sutton & φ k = arg min φ E D [(Q φ (s, a) -y) 2 ], y = r(s, a) + γE s ,a [Q φ k-1 (s , a )]. During policy improvement, the actor π is typically updated based on the current estimate of Q π . A commonly used technique (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) is to update the actor π θ k (a|s) via likelihood ratio or pathwise derivatives to optimize the following objective, such that the expected value of the Q-function Q π is maximized: θ k = arg max θ E s∼D [E π θ (a|s) [Q φ k (s, a)]] Actor-critic algorithms are widely used in deep RL (Mnih et al., 2016; Lillicrap et al., 2016; Haarnoja et al., 2018; Fujimoto et al., 2018) . With a Q-function estimator, they can in principle utilize off-policy data when used with a replay buffer for storing prior transition tuples, which we will denote β, to sample previous transitions, although we show that this by itself is insufficient for our problem setting.



The action-value function for a policy can be written recursively via the Bellman equation:Q π (s, a) = r(s, a) + γE p(s |s,a) [V π (s )] = r(s, a) + γE p(s |s,a) [E π(a |s ) [Q π (s , a )]].

