GUIDED EXPLORATION WITH PROXIMAL POLICY OP-TIMIZATION USING A SINGLE DEMONSTRATION Anonymous

Abstract

Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article uses a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions. We adapt this idea and integrate it with the proximal policy optimization (PPO). The agent is able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.

1. INTRODUCTION

Exploration is one of the most challenging problems in reinforcement learning. Although significant progress has been made in solving this problem in recent years (Badia et al., 2020a; b; Burda et al., 2018; Pathak et al., 2017; Ecoffet et al., 2019) , many of the solutions rely on very strong assumptions such as access to the agent position and deterministic, fully observable or low dimensional environments. For three-dimensional, partially observable stochastic environments with no access to the agent position the exploration problem still remains unsolved. Learning from demonstrations allows to directly bypass this problem but it only works under specific conditions, e.g. large number of demonstration trajectories or access to an expert to query for optimal actions (Ross et al., 2011) . Furthermore, a policy learned from demonstrations in this way will only be optimal in the vicinity of the demonstrator trajectories and only for the initial conditions of such trajectories. Our work is at the intersection of these two classes of solutions, exploration and imitation, in that we use only one trajectory from the demonstrator per problem to solve hardexploration tasksfoot_0 . This approach has been explored before by Paine et al. (2019) (for a thorough comparison see Section 5.2). We propose the first implementation based on on-policy algorithms, in particular PPO. Henceforth we refer to the version of the algorithm we put forward as PPO + Demonstrations (PPO+D). Our contributions can be summarized as follows: 1. In our approach, we treat the demonstrations trajectories as if they were actions taken by the agent in the real environment. We can do this because in PPO the policy only specifies a distribution over the action space. We force the actions of the policy to equal the demonstration actions instead of sampling from the policy distribution and in this way we accelerate the exploration phase. We use importance sampling to account for sampling from a distribution different than the policy. The frequency with which the demonstrations are sampled depends on an adjustable hyperparameter ρ, as described in Paine et al. (2019) . 2. Our algorithm includes the successful trajectories in the replay buffer during training and treats them as demonstrations. 3. The non-successful trajectories are ranked according to their maximum estimated value and are stored in a different replay buffer. 4. We mitigate the effect of catastrophic forgetting by using the maximum reward and the maximum estimated value of the trajectories to prioritize experience replay. PPO+D is only in part on-policy as a fraction of its experience comes from a replay buffer and therefore was collected by an older version of the policy. The importance sampling is limited to the action loss in PPO and does not adjust the target value in the value loss as in Espeholt et al. (2018) . We found that this new algorithm is capable of solving problems that are not solvable using normal PPO, behavioral cloning, GAIL, nor combining behavioral cloning and PPO. PPO+D is very easy to implement by only slightly modifying the PPO algorithm. Crucially, the learned policy is significantly different and more efficient than the demonstration used in the training. To test this new algorithm we created a benchmark of hard-exploration problems of varying levels of difficulty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019) . All the tasks considered have random initial position and the PPO policy uses entropy regularization so that memorizing the sequence of actions of the demonstration will not suffice to complete any of the tasks.

2. RELATED WORK

Different attempts have been made to use demonstrations efficiently in hard-exploration problems. In Salimans & Chen (2018) the agent is able to learn a robust policy only using one demonstration. The demonstration is replayed for n steps after which the agent is left to learn on its own. By incrementally decreasing the number of steps n, the agent learns a policy that is robust to randomness (introduced in this case by using sticky actions or no-ops (Machado et al., 2018) , since the game is fundamentally a deterministic one). However, this approach only works in a fully deterministic environment since replaying the actions has the role of resetting the environment to a particular configuration. This method of resetting is obviously not feasible in a stochastic environment. (Tang et al., 2017; Bellemare et al., 2016; Ostrovski et al., 2017) . These algorithms keep track of the states where the agent has been before (if we are dealing with a prohibitive number of states, the dimensions along which the agent moves can be discretized), and give the agent an incentive (in the form of a bonus reward) for visiting new states. This approach assumes we have a reliable way to track the position of the agent. An empirical comparison between these two classes of exploration algorithms was made in Baker et al. (2019) , where agents compete against each other leveraging the use of tools that they learn to manipulate. They found the count-based approach works better when applied not only to the agent position, but also to relevant entities in the environment (such as the positions of objects). When only given access to the agent position, the RND algorithm (Burda et al., 2018) was found to lead to a higher performance. Some other works focus on leveraging the use of expert demonstrations while still maximizing the reward. These methods allow the agent to outperform the expert demonstration in many problems. Hester et al. (2018) combines temporal difference updates in the Deep Q-Network (DQN) algorithm with supervised classification of the demonstrator's actions. Kang et al. (2018) proposes to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations. Other approaches, such as Duan et al. (2017) ; Zhou et al. (2019) pursue a meta-learning strategy where the agent learns to learn from demonstrations, such approaches are perhaps the most promising, but they require at least a demonstration for each task for a subset of all tasks. Generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) has never been successfully applied to complex partially observable environments that require memory (Paine et al., 2019) . In-foGAIL (Li et al., 2017) has been used to learn from images but, unlike in this work the policy does not require recurrency to complete the tasks. Behavioral Cloning (BC) is the most basic imitation learning technique, it is equivalent to supervised classification of the demonstrator's actions (Rahmatizadeh et al., 2018) . Both GAIL and BC have been used in the Obstacle Tower Challenge (Juliani et al., 2019) , but are alone insufficient for solving hard-exploration problems (Nichol, 2019) . Perhaps the article that has the most similarity with this work is Oh et al. (2018) . The authors present an off-policy algorithm that learns to reproduce the agent's past good decisions. Their work focuses mainly on advantage actor-critic, but the idea was tested also with PPO. Instead, PPO+D utilizes importance sampling, leverages expert (human) demonstrations, uses prioritized experience replay based on the maximum value of each trajectory to overcome catastrophic forgetting and the ratio of demonstrations replayed during training can be explicitly controlled.

3. DEMONSTRATION GUIDED EXPLORATION WITH PPO+D

Our approach attempts to exploit the idea of combining demonstrations with the agent's own experience in an on-policy algorithm such as PPO. This approach is particularly effective for hardexploration problems. One can view the replay of demonstrations as a possible trajectory of the agent in the current environment. This means that the only point we are interfering with the classical PPO is when sampling, which is substituted by simply replaying the demonstrations. A crucial difference between PPO+D and R2D3 (Paine et al., 2019) is that we do consider sequences that contain entire episodes in them, and therefore using recurrency is much more straightforward. From the perspective of the agent it is as if it is always lucky when sampling the actions, and in doing so it is skipping the exploration phase. The importance sampling formula provides the average rewards over policy π θ given trajectories generated from a different policy π θ (a|s): E π θ [r t ] = E π θ π θ (a t |s t ) π θ (a t |s t ) r t , where r t = R(r|a t , s t ) is the environment reward given state s and action a at time t. E π θ indicates that the average is over trajectories drawn from the parameterized policy π θ . The importance sampling term π θ (at|st) π θ (at|st) accounts for the correction in the change of the distribution over actions in the policy π θ (a t |s t ). By maximizing E π θ π θ (at|st) π θ (at|st) r t over the parameters θ a new policy is obtained that is on average better than the old one. The PPO loss function is then defined as L CLIP +V F +S t (θ) = E t L CLIP t (θ) + c 1 L V F t (θ) + c 2 S π θ (s t , a t ) , where L CLIP t (s, a, θ , θ) = min π θ (a t |s t ) π θ (a t |s t ) A π θ (s t , a t ), g( , A π θ (s t , a t )) and c 1 and c 2 are coefficients, L V F t is the squared-error loss (V θ (s t )-V targ t ) 2 , A π θ is an estimate of the advantage function and S is the entropy of the policy distribution. The entropy term is designed to help keep the search alive by preventing convergence to a single choice of output, especially when several choices all lead to roughly the same reward (Williams & Peng, 1991) . Let D be the set of trajectories τ = (s 0 , a 0 , s 1 , a 1 , ...) for which we have a demonstration, then π D (a t |s t ) = 1 for (a t , s t ) in D and 0 otherwise. This is a policy that only maps observations coming from the demonstration buffer to a distribution over actions. Such distribution assigns probability one to the action taken in the demonstration and zero to all the remaining actions. The algorithm decides where to sample trajectories from each time an episode is completed (for running out of time or for completing the task). We define D to be the set of all trajectories that can get replayed at any given time D = D V ∪ D R , where D V are the trajectories collected prioritizing the value estimate (V stands for value), and D R (R stands for reward) contains the initial human demonstration and successful trajectories the agent collects. The agent samples from the trajectories in D R with probability ρ , D V with probability φ and from the real environment Env with probability 1 -ρ -φ subject to ρ + φ ≤ 1 . The behavior of the policy can be defined as follows: π φ,ρ θ =    π D R , if sampled from D R π D V , if sampled from D V π θ , if sampled from Env In PPO+D we substitute the current policy π θ with the policy π φ,ρ θ , since this is the policy used to collect the demonstration trajectories. The clipping term is then changed to correct for it, for actors 1, 2, . . . , N do L CLIP -P P O+D t (s, a, θ , θ) = min π θ (a t |s t ) π φ,ρ θ (a t |s t ) A π φ,ρ θ (s t , a t ), g( , A π φ,ρ θ (s t , a t )) . (4) 7: Sample τ from {D V , D R , Env} 8: if τ ∈ D R then 9: With probability ρ sample batch of demonstrations With probability 1 -ρ -φ collect set of trajectories by running policy π θ : 22: for steps 1, 2, . . . , T do

23:

Execute an action in the environment s t , a t , r t , s t+1 ∼ π θ (a t |s t ) 24: Store transition E ← E ∪ {(s t , a t , r t )}  ρ = ρ + φ 0 |D V | 0 ; φ = φ - φ 0 |D V | 0 ; |D V | = |D V | -1; |D R | = |D R | + 1 where |D R | 0 and |D V | 0 are the maximum size respectively for D R and D V and φ 0 is the value of φ at the beginning of the training. We define the probability of sampling trajectory τ i as P (i) = p α i k p α k , where p i = max t V θ (s t ), and α is a hyperparameter. We shift the value of p i for all the trajectories in D V so as to guarantee p i ≥ 0. We only keep up to |D V | unsuccessful trajectories at any given time. Values are only updated for the replayed transitions. Successful trajectories are similarly sampled from D R with a uniform distribution (a better strategy could be to sample according to their length), and the buffer is updated following a FIFO strategy. We introduced the value-based experience replay because we get to a level of complexity in some tasks that we could not solve by using self-imitation solely based on the reward. These are the tasks that the agent has trouble solving even once because the sequence of actions is too long or complicated. We prefer the value estimate as a selection criteria rather than the more commonly used TD-error as we speculate it is more effective at retaining promising trajectories over long periods of time. Crucially for the unsuccessful trajectories it is possible for the maximum value estimate not to be zero when the observations are similar to the ones seen in the demonstration. We think that this plays a role in countering the effects of catastrophic forgetting, thus allowing the agent to combine separately learned sub-behaviors in a successful policy. We illustrate this with some example trajectories of the agent shown in the Appendix in Figure 7 . The value estimate is noisy and as a consequence of that, trajectories that have a low maximum value estimate on first visits may not be replayed for a long time or never as pointed out in Schaul et al. (2015) . However, for our strategy to work it is enough for some of the promising trajectories to be collected and replayed by this mechanism.

4.1. THE ANIMAL-AI OLYMPICS CHALLENGE ENVIRONMENT

The recent successes of deep reinforcement learning (DRL) (Mnih et al., 2015; Silver et al., 2017; Schulman et al., 2017; Schrittwieser et al., 2019; Srinivas et al., 2020) have shown the potential of this field, but at the same time have revealed the inadequacies of using games (such as the ATARI games (Bellemare et al., 2013) ) as a benchmark for intelligence. These inadequacies have motivated the design of more complex environments that will provide a better measure of intelligence. The Obstacle Tower Environment (Juliani et al., 2019) , the Animal AI Olympics (Crosby et al., 2019) , the Hard-Eight Task Suite (Paine et al., 2019) and the DeepMind Lab (Beattie et al., 2016) all exhibit sparse rewards, partial observability and highly variable initial conditions. For all the tests we use the Animal-AI Olympics challenge environment. The aim of the Animal-AI Olympics is to translate animal cognition into a benchmark of cognitive AI (Crosby et al., 2019) . The environment contains an agent enclosed in a fixed size arena. Objects can spawn in this arena, including positive and negative rewards (green, yellow and red spheres) that the agent must obtain or avoid. This environment has basic physics rules and a set of objects such as food, walls, negativereward zones, movable blocks and more. The playground can be configured by the participants and they can spawn any combination of objects in preset or random positions. We take advantage of the great flexibility allowed by this environment to design hard-exploration problems for our experiments. Figure 2 : Tasks. In each of the tasks there is only one source of reward and the position of some of the objects is random, so each episode is different. The agent has no access to the aerial view, instead it partially observes the world through a first person view of the environment. All of the tasks are either inspired or adapted from the test set in the Animal-AI Olympics competition.

4.2. EXPERIMENTAL SETTING

We designed our experiments in order to answer the following questions: Can the agent learn a policy in a non-deterministic hard-exploration environment only with one human demonstration? Is the agent able to use the human demonstration to learn to solve a problem with different initial conditions (agent and boxes initial positions) than the demonstration trajectory? How does the hyperparameter φ influence the performance during training? The four tasks we used to evaluate the agent are described as follows:

• One box easy

The agent has to move a single box, always spawned at the same position, in order to bridge a gap and be able to access the reward once it climbs up the ramp (visible in pink). The agent can be spawned in the range (X : 0.5 -39.5, Y : 0.5 -39.5) if an object is not already present at the same location (Fig. 2a ).

• One box hard

The agent has to move a single box in order to bridge a gap and be able to access the reward, this time two boxes are spawned at any of four positions A : (X : 10, Y : 10), B : (X : 10, Y : 30), C : (X : 30, Y : 10), D : (X : 30, Y : 30). The agent can be spawned in the range (X : 0.5 -39.5, Y : 0.5 -39.5) if an object is not already present at the same location (Fig. 2b ).

• Two boxes easy

The agent has to move two boxes in order to bridge a larger gap and be able to access the reward, this time two boxes are spawned at any of four positions A : (X : 10, Y : 10), B : (X : 10, Y : 30), C : (X : 30, Y : 10), D : (X : 30, Y : 30). The agent can be spawned in the range (X : 15.0 -20.0, Y : 0.5 -15.0) if an object is not already present at the same location (Fig. 2c ).

• Two boxes hard

The agent has to move two boxes in order to bridge a larger gap and be able to access the reward. Two boxes are spawned at two fixed positions A : (X : 10, Y : 10), B : (X : 10, Y : 30). A wall acts as a barrier in the middle of the gap, to prevent the agent from "surfing" a single box. The agent can be spawned in the range (X : 15.0 -20.0, Y : 5.0 -10.0) if an object is not already present at the same location (Fig. 2d ).

5.1. COMPARISON WITH BASELINES AND GENERALIZATION

In Figure 3 we compare PPO+D with parameters ρ = 0.1, φ = 0.3 to the behavioral cloning baselines (with 100 and 10 demonstrations), to GAIL (with 100 demonstrations) and with PPO+BC. PPO+BC combines PPO and BC in an a similar way to PPO+D: with probability ρ a sample is drawn from the demonstrations and the policy is updated using the BC loss. The value loss function remains unchanged during the BC update. We test the GAIL implementation on a simple problem to verify it is working properly (see Section C in the Appendix). For behavioral cloning we trained for 3000 learner steps (updates of the policy) with learning rate 10 -5 . It is clear that PPO+D is able to outperform the baseline in all four problems. The performance of PPO+D varies considerably from task to task. This reflects the considerable difference in the range of the initial conditions for different tasks. In the tasks "One box easy" and "Two boxes hard" only the position of the agent sets different instances of the task apart. The initial positions of the boxes only play a role in the tasks "One box hard" and "Two boxes easy". Under closer inspection we could verify that in these two tasks the policy fails to generalize to configurations of boxes that are not seen in the demonstration, but does generalize well in all tasks for very different agent starting positions and orientations. This could be because there is a smooth transition between some of the initial conditions. Due to this fact, if the agent is able to generalize even only between initial conditions that are close it will be able to progressively learn to perform well for all initial conditions starting from one demonstration. In other words the agent is automatically using a curriculum learning strategy, starting from the initial conditions that are closer to the demonstration. This approach fails when there is an abrupt transition between initial conditions, such as for different boxes configurations. During training we realized that the policy "cheated" in the task "Two boxes easy" as it managed to "surf" one of the boxes, in this way avoiding to move the remaining box (a similar behavior was reported in Baker et al. (2019) ). Although this is of itself remarkable, we were interested in testing the agent for a more complex behavior, which it avoids by inventing the "surfing" behavior. To make sure it is capable of such more complex behavior we introduced "Two boxes hard". We decided to reduce the range of the initial conditions in this last task, as we already verified that the agent can learn from such variable conditions in tasks "One box hard" and "Two boxes easy". This last experiment only tests the agent for a more complex behaviour. In the tasks "One box hard" and "Two boxes easy" the agent could achieve higher performance given more training. We emphasise that PPO+D is designed to perform well on hard-exploration problems with stochastic environment and variable different conditions, we tested it on the of the Atari environment "BreakoutNoFrameskip-v4" and conclude that it does not lead to better performance than vanilla PPO when the reward is dense. PPO+D also learns to complete the task although more slowly than PPO.

5.2. THE ROLE OF φ AND THE VALUE-BUFFER D V

In Figure 4 we mainly compare PPO+D with ρ = 0.1, φ = 0.3 to ρ = 0.5, φ = 0.0. Crucially, the second parameter configuration does not use the buffer D V . It is evident in the task "Two boxes easy" that D V is essential for solving harder exploration problems. In the "One box easy" task we can see that occasionally on easier tasks not having D V can result in faster learning. However, this comes with a greater variance in the training performance of the agent, as sometimes it completely fails to learn. In Figure 4 we also run vanilla PPO (ρ = 0.0, φ = 0.0) on the task "One box easy" and establish its inability to solve the task even once on any seed and any initial condition. This being the easiest of all tasks, we consider unlikely the event of vanilla PPO successfully completing any of the other harder tasks. We defer an ablation study of the parameter ρ to section B in the Appendix.Although we can not make a direct comparison with Paine et al. ( 2019), we think it is useful to underline some of the differences between the two approaches both in the problem setting and in the performance. We attempted to recreate the same complexity of the tasks on which Paine et al. ( 2019) was tested. In the article, variability is introduced in the tasks on multiple dimensions such as position and 

6. CONCLUSION

We introduce PPO+D, an algorithm that uses a single demonstration to explore more efficiently in hard-exploration problems. We further improve on the algorithm by introducing two replay buffers: one containing the agent own successful trajectories as it collects these over the training and the second collecting unsuccessful trajectories with a high maximum value estimate. In the second buffer the replay of the trajectories is prioritized according to the maximum estimated value. We show that training with both these buffers solves, to some degree, all tasks the agent was presented with. We also show that vanilla PPO as well as PPO+D without the value-buffer fails to learn the same tasks. In the article, we justify the choice of such adjustments as measures to counter catastrophic forgetting, a problem that afflicts PPO particularly. The present algorithm suffers some limitations as currently it fails to generalize to all variations of some of the problems, yet it achieves to solve several very hard exploration problems with a single demonstration. We propose to address these limitations in future work.

A TRAINING DETAILS

For the training we used 14 parallel environments and we compute the gradients using the Adam optimizer (Kingma & Ba, 2014) with fixed learning rate of 10 -5 . The agent perceives the environment through a 84 by 84 RGB pixels observations in a stack of 4. At each time-step the agent is allowed to take one of nine actions. We use the network architecture proposed in Kostrikov (2018) which includes a gated recurrent unit (GRU) (Cho et al., 2014) with a hidden layer of size 256. We ran the experiments on machines with 32 CPUs and 3 GPUs, model GeForce RTX 2080 Ti. The experiments where carried out with the following hyperparameters. For the training we performed no hyperparameter search over the replay ratios φ and ρ but set them to a reasonable number. We found other configurations of these parameters to be sometimes more efficient in the training, such for example setting ρ = 0.5 and φ = 0.0 in the task "One box easy". The parameters we ran all the experiments with have been chosen because they allow to solve all of the experiments with one demonstration. Figure 6 : Food collection task. In this task the the agent is spawned into the arena with one green ball. The green food size and position are set randomly at the beginning of each episode. The episode ends when the green food is collected. Our implementation of GAIL based on Li et al. (2017) was trained with the following hyperparameters besides the PPO parameters in Table 1 . In Table 3 we report the performance of both GAIL and behaviour cloning on the simple task. We note that although both methods achieve reasonable performance, the agent trained with GAIL reaches near-perfect performance, whereas the BC agent performance tends to fluctuate significantly. The GAIL policy was trained for 120 millions time-frames and behavioral cloning we trained for 3000 learner steps (updates of the policy) with learning rate 10 -5 .



A video showing the experimental results is available at https://www.youtube.com/playlist? list=PLBeSdcnDP2WFQWLBrLGSkwtitneOelcm-



P O wrt θ, with K epochs and minibatches size M ≤ N T 30: θ ← θ -η∇ θ L P P O 31: Empty rollout storage E ← {} 32: end for 3.1 SELF-IMITATION AND PRIORITIZED EXPERIENCE REPLAY We hold the total size of the buffers |D| constant throughout the training. At the beginning of the training we only provide the agent with one demonstration trajectory, so |D R | = 1 and |DV | = |D|-1 as |D| = |D V |+|D R |. D Vis only needed at the beginning to collect the first successful trajectories. As |D R | increases and we have more variety of trajectories in D R , we decrease the probability φ of replaying trajectories from D V . After |D R | is large enough, replaying the trajectories from D V is more of an hindrance. The main reason for treating these two buffers separately is to slowly anneal from one to the other avoiding the hindrance in the second phase.|D V | and |D R | are the quantities that are annealed according to the following formulas, each time a new successful trajectory is found and is added to D R :

Figure 1: Learner diagram. The PPO+D learner samples batches that are a mixture of demonstrations and the experiences the agent collects by interacting with the environment.

orientation of the agent, shape and color of objects, colors of floor and walls, number of objects in the environment and position of such objects. The range of the initial conditions for each of these factors was not reported. In our work we change the initial position and orientation of the agent as well as the initial positions of the boxes. As for the number of demonstrations in Paine et al. (2019) the agent has access to a hundred demonstrations, compared to only one in our work. In the present article, the training time ranges from under 5 millions frames to 500 millions, whereas in Paine et al. (2019) it ranges from 5 billions to 30. Although we adopted the use of the parameter ρ from Paine et al. (2019) its value differs considerably, which we attribute to the difference between the two algorithms: one being based on PPO, the other on DQN.

Figure 3: Experiments. Performance of behavioural cloning with ten and a hundred recorded human demonstrations and PPO+D with ρ = 0.1, φ = 0.3 and just one demonstration. The curves represent the mean, min and max performance for each of the baselines across 3 different seeds. The BC agent sporadically obtains some rewards. GAIL with a demonstrations never achieves any reward. PPO+BC has only access to one demonstration, like PPO+D. It occasionally solves the task but it is unable to archive high performance.

Figure4: Experiments. Performance for vanilla PPO (ρ = 0.0, φ = 0.0), PPO+D with ρ = 0.5, φ = 0.0 and PPO+D with ρ = 0.1, φ = 0.3 on the tasks "One box easy" and "Two boxes easy" using a single demonstration. Some of the curves overlap each other as they receive zero or close to zero reward. Vanilla PPO never solves the task.

Model and PPO Hyperparameters

GAIL Hyperparameters

Performance on the "Food collection task"

annex

Under review as a conference paper at ICLR 2021 In computing the probability of a trajectory to be replayed P (i) = p α i k p α k , α = 10. The total buffer size is |D| = 51 with |D V | 0 = 50 plus the human generated trajectory. |D R | 0 = 51 meaning once the agent collects 50 successful trajectories, new successful trajectories overwrite old ones, following a FIFO strategy and no trajectory is replayed from the value-buffer.The implementation used is based on the repository Kostrikov (2018) . On our infrastructure we could train at approximately a speed of 1.3 millions frames per hour.The code, pre-trained models, data-set of arenas used for training, as well as video-clips of the agent performing the tasks are available at https://doi.org/10.6084/m9.figshare. 12459602.v1.

B HYPERPARAMETERS ABLATION

In this section we present the results of four different experiments on a variation of the "One box easy task" where the agent position does not change across episodes and it is shared with the demonstration. We test on this variation because it is one of the simplest problems we can use to test PPO+D performance. We only perform the ablation study on ρ because φ is harder to test: it is indispensable for solving difficult tasks but it can slow down the performance on easy tasks. This being an easy task, the results obtained, do not provide any insights on the effect of φ in harder problems (as shown in Figure 4 ).The following figure shows the performance of the PPO+D algorithm where the ρ parameter is changed and φ = 0. Interestingly we observe that, among the values chosen, the performance peaks for ρ = 0.3. We hypothesize that lower ρ values have worse performance because the interval between demo replays is so large that allows the network to forget the optimal policy learned with the demonstrations. On the other side, higher values of ρ are even more counterproductive as they prevent the agent from learning from its own experience, most critically learning what not to do. 

C GAIL TEST

To verify the correctness of our GAIL implementation we use for the experiments in Figure 4 we test it on a simple task in the Animal-AI environment. The task is shown in Figure 6 , it consists in collecting green food of random size and position. The value-buffer experience replay creates an incremental curriculum for the agent to learn, keeping different trajectories that achieved an high maximum value in the buffer incentives the agent to combine these different sub-behaviours e.g. pushing the blocks and going up the ramp.

