GUIDED EXPLORATION WITH PROXIMAL POLICY OP-TIMIZATION USING A SINGLE DEMONSTRATION Anonymous

Abstract

Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article uses a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions. We adapt this idea and integrate it with the proximal policy optimization (PPO). The agent is able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.

1. INTRODUCTION

Exploration is one of the most challenging problems in reinforcement learning. Although significant progress has been made in solving this problem in recent years (Badia et al., 2020a; b; Burda et al., 2018; Pathak et al., 2017; Ecoffet et al., 2019) , many of the solutions rely on very strong assumptions such as access to the agent position and deterministic, fully observable or low dimensional environments. For three-dimensional, partially observable stochastic environments with no access to the agent position the exploration problem still remains unsolved. Learning from demonstrations allows to directly bypass this problem but it only works under specific conditions, e.g. large number of demonstration trajectories or access to an expert to query for optimal actions (Ross et al., 2011) . Furthermore, a policy learned from demonstrations in this way will only be optimal in the vicinity of the demonstrator trajectories and only for the initial conditions of such trajectories. Our work is at the intersection of these two classes of solutions, exploration and imitation, in that we use only one trajectory from the demonstrator per problem to solve hardexploration tasksfoot_0 . This approach has been explored before by Paine et al. ( 2019) (for a thorough comparison see Section 5.2). We propose the first implementation based on on-policy algorithms, in particular PPO. Henceforth we refer to the version of the algorithm we put forward as PPO + Demonstrations (PPO+D). Our contributions can be summarized as follows: 1. In our approach, we treat the demonstrations trajectories as if they were actions taken by the agent in the real environment. We can do this because in PPO the policy only specifies a distribution over the action space. We force the actions of the policy to equal the demonstration actions instead of sampling from the policy distribution and in this way we accelerate the exploration phase. We use importance sampling to account for sampling from a distribution different than the policy. The frequency with which the demonstrations are sampled depends on an adjustable hyperparameter ρ, as described in Paine et al. ( 2019). 2. Our algorithm includes the successful trajectories in the replay buffer during training and treats them as demonstrations. 3. The non-successful trajectories are ranked according to their maximum estimated value and are stored in a different replay buffer. 4. We mitigate the effect of catastrophic forgetting by using the maximum reward and the maximum estimated value of the trajectories to prioritize experience replay. PPO+D is only in part on-policy as a fraction of its experience comes from a replay buffer and therefore was collected by an older version of the policy. The importance sampling is limited to the action loss in PPO and does not adjust the target value in the value loss as in Espeholt et al. (2018) . We found that this new algorithm is capable of solving problems that are not solvable using normal PPO, behavioral cloning, GAIL, nor combining behavioral cloning and PPO. PPO+D is very easy to implement by only slightly modifying the PPO algorithm. Crucially, the learned policy is significantly different and more efficient than the demonstration used in the training. To test this new algorithm we created a benchmark of hard-exploration problems of varying levels of difficulty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019) . All the tasks considered have random initial position and the PPO policy uses entropy regularization so that memorizing the sequence of actions of the demonstration will not suffice to complete any of the tasks.

2. RELATED WORK

Different attempts have been made to use demonstrations efficiently in hard-exploration problems. In Salimans & Chen (2018) the agent is able to learn a robust policy only using one demonstration. The demonstration is replayed for n steps after which the agent is left to learn on its own. By incrementally decreasing the number of steps n, the agent learns a policy that is robust to randomness (introduced in this case by using sticky actions or no-ops (Machado et al., 2018) , since the game is fundamentally a deterministic one). However, this approach only works in a fully deterministic environment since replaying the actions has the role of resetting the environment to a particular configuration. This method of resetting is obviously not feasible in a stochastic environment. et al., 2017; Bellemare et al., 2016; Ostrovski et al., 2017) . These algorithms keep track of the states where the agent has been before (if we are dealing with a prohibitive number of states, the dimensions along which the agent moves can be discretized), and give the agent an incentive (in the form of a bonus reward) for visiting new states. This approach assumes we have a reliable way to track the position of the agent. An empirical comparison between these two classes of exploration algorithms was made in Baker et al. (2019) , where agents compete against each other leveraging the use of tools that they learn to manipulate. They found the count-based approach works better when applied not only to the agent position, but also to relevant entities in the environment (such as the positions of objects). When only given access to the agent position, the RND algorithm (Burda et al., 2018) was found to lead to a higher performance. Some other works focus on leveraging the use of expert demonstrations while still maximizing the reward. These methods allow the agent to outperform the expert demonstration in many problems. Hester et al. (2018) combines temporal difference updates in the Deep Q-Network (DQN) algorithm with supervised classification of the demonstrator's actions. Kang et al. (2018) proposes to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations.



A video showing the experimental results is available at https://www.youtube.com/playlist? list=PLBeSdcnDP2WFQWLBrLGSkwtitneOelcm-



Ecoffet et al. (2019) is another algorithm that largely exploits the determinism of the environment by resetting to previously reached states. It works by maximizing the diversity of the states reached. It is able to identify such diversity among states by down-sampling the observations, and by considering as different states only those observations that have a different down-sampled image. This works remarkably well in two dimensional environments, but is unlikely to work on three-dimensional, stochastic environments. Self-supervised prediction approaches, such as Pathak et al. (2017); Burda et al. (2018); Schmidhuber (2010); Badia et al. (2020b) have been used successfully in stochastic environments, although they are less effective in three-dimensional environments. Another class of algorithms designed to solve exploration problems are count-based methods (Tang

